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Digital surveillance is a leading-edge security problem and has been unprece- 
dentedly applied to monitor and protect our lives almost anywhere. It is ubiquitous 
and practical and could greatly lessen security staff’s human labor. With the aid of 
intelligence, surveillance systems have the capability to automatically complete 
tasks such as object detection, recognition, tracking as well as event life cycle 
including event detection, recognition, search, retrieval, mining, reasoning, etc. 
With the development of artificial intelligence (AI), the systems have taken in from 
the progression of supercomputing, cloud computing, big data, deep learning, etc. 

This book starts from camera calibration, along with surveillance data capturing, 
scrambling and descrambling, secure transmission with secure network environ- 
ment; afterward, the surveillance at object level has been introduced such as 
detection, recognition, and tracking. Biometrics is presented as an important part of 
this book. A picture is more than a thousand of words, an event is more than 
thousands of pictures. An event is the basic unit of knowledge which bridges the 
gap between physical world and semantic objects; the life cycle of an event includes 
generating new ones and operations on events, etc. The knowledge based on events 
could be used for discovery and exploration, and this will be employed for 
surveillance alarm making; at the end of this book, fundamentals of supercom- 
puting (FPGA, GPU, and parallel computing), cloud computing, mobile computing, 
deep learning, etc., will be emphasized. 

This book is based on our research and teaching experience, we have used the 
content of this book for postgraduate teaching in the higher education. The stuff 
of the whole course including assignments and examinations has been verified for a 
plurality of times and could serve the readers of this book well. 

This book was written for research students and engineers as well as scientists 
who are interested in intelligent surveillance. 


Auckland, New Zealand Wei Qi Yan 
October 2018 


The first edition of this book has been published in March 2016. In the past two 
years, we have endowed in all aspects to make the book full and perfect. Following 
the feedback from readers and audiences responses, the author has further omitted 
mistakes and typos, detailed the mathematical descriptions and algorithms, as well 
as updated each chapter with the latest contents and references. 

The second edition of this book was published in June 2017 and emphasized on 
red-hot technology such as deep learning, mobile and cloud computing, and big 
data in intelligent surveillance. The author’s endeavor was about how surveillance 
research and teaching could take in nutrition from the progress of other fields and 
what the audience and readers could pay their attention to when they read this book. 

The third edition of this book is emphasized on human behavior analysis, 
privacy preservation, and the details of deep learning and artificial intelligence (AD). 
We integrate the latest development in AI and machine learning into this book for 
meeting the trends of today’s research. The book shows how machine intelligence 
could assist security people in surveillance with regard to the fundamental aspects: 
observation, learning, presentation, and reasoning or references. 
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who have given invaluable comments on this book. Special thanks to my super- 
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Mr. D. Gu, Mr. J. Wang, Dr. Y. Zhang, Mr. L. Zhou, Mr. J. Lu, Mr. J. Shen, 
Mr. D. Shen, Mr. K. Zheng, Ms Y. Ren, Mr. R. Li, Mr. P. Li, Mr. Z. Liu, 
Ms. Y. Shen, Ms. H. Wang, Mr. C. Xin, Ms. Q. Zhang, Ms X. Zhang, Dr. Q. Gu, 
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Introduction 


1.1 Introduction to Surveillance 


The slumped costs coupled with rapid miniaturization of video cameras have enabled 
its widespread use on highways, airports, railway stations, and on-road vehicles. 
Intelligent surveillance has become an essential and practical mean for security which 
combines computer science and engineering with multidisciplinary studies includ- 
ing data repository, computer vision [14,17], digital image processing, computer 
graphics as well as computational intelligence. Different from traditional monitor- 
ing, scenes related to security concerns are possible to be monitored automatically 
and remotely with assistance of intelligent surveillance equipped. 

Security in surveillance are closely related to public safety. The threats from 
attacks have been stressed significantly by governments, and asset security is always 
a crucial concern for all organizations. The demand of remote surveillance system 
for safety and security is mainly in the areas such as transport applications, maritime 
environments, public places, and military applications. 

Basic equipment needed in intelligent surveillance consists of calibrated sensors 
which are installed and linked under the control of a monitoring center. The advan- 
tages of monitoring a broad scope of district effectively facilitate security staff’s 
work as well as lessen the required number of guardians for monitoring. It is well 
known that an effective surveillance system should greatly reduce security staff’s 
human labor. 

The pipeline of a surveillance system includes moving object detection, recogni- 
tion, tracking, event analyzing, database archiving, and alarm making as shown in 
Fig. 1.1. 

Moving object detection in visual surveillance is the key step, especially for fore- 
ground and background separation. Background subtraction is subjective to fore- 
ground changes and has better performance to each moving object, especially in 
object tracking [16]. The conventional technique of video frame difference includes 
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Fig. 1.1 Pipeline of a surveillance system 


the subtraction of two consecutive frames simply followed by thresholding and ana- 
lyzing gradient or histogram; key points as a typical feature of motion also have been 
applied to foreground and background separation. Nowadays, motion analysis and 
optical flow by using information theory and deep learning have been employed to 
video dynamic analysis [42] and could get surprising results [9]. 

Object tracking techniques are split into two main categories: 2D and 3D models. 
The model-based approaches explicitly take use of a priori geometrical information 
of moving objects which usually include pedestrians, vehicles, or both in surveillance 
applications. A number of approaches have been applied to classify newly detected 
objects and furthermore to be used for object tracking. 

At the stage of object tracking, filtering (Kalman filetering, Bayesian filtering, 
particle filtering, etc.) is used to predict every position of the target object. Another 
tracking approach uses connected components to segment the changes into different 
objects without prior knowledge. The approach exhibits a good performance when 
the object has a very low resolution. 

A plenty of scenes having high security interest, such as airports, train stations, 
shopping malls, and street intersections, usually are very crowded. It is impossible 
to track objects over long distance without failures because of frequent occlusions 
among objects in such scenes. Although a great deal of existing surveillance systems 
work well in sparse scenes, there are still a slew of challenges unsolved in the real 
applications. 

Intelligent surveillance is beneficial from recognition and understanding of actions, 
activities, and behaviors of the tracked objects. This stage corresponds to classifi- 
cation of the time-varying features that are acquired from the preceding stages. 
Therefore, the tracking comprises of matching a picture sequence to a well-prepared 
dataset having labels that needs to be learned via data training process. 

One of the key stages in intelligent surveillance is search and retrieval that are 
the crucial aspect, especially for alarm making. Relevant research work has been 
conducted in how to store and retrieve all the obtained surveillance events in an 
efficient manner, especially over the cloud. 

Alarm making is the ultimate goal of intelligent surveillance; the difficulty in 
alarm making is to reduce false alarms which easily make security staff annoyed 
and potentially miss the positive alarms. We thus insist working for alarm mak- 
ing approaches, typically simple and complex alarming. Alarm making could be 
spatially-based or temporally-based, but it should follow the process of decision 
making basically. Hence, decision trees and random forests could help in decision 
making. 
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Fig. 1.2 Event-based surveillance system 


An event-based surveillance system shown in Fig. 1.2 was developed which has 
the features: login, monitors, sensors, events, alarms, and logout. 


1.1.1 Surveillance History 


Surveillance is deemed to have experienced three generations. The first generation 
(1G) is analog closed-circuit television (CCTV). In this stage, distribution and storage 
of surveillance information are taking advantage of analog technologies, digital-to- 
analog (D/A) conversion has taken the pivotal role. The second generation (2G) is 
aided by utilizing computer vision technology [14]; typically event detection [5], 
behavior analysis, machine learning and deep learning, scene understanding, and 
semantic interpretations have been adopted in surveillance, semantic analysis was 
emphasized on this generation shown in Fig. 1.3. 

Since 1G systems were initially developed in the middle of the 1960s which are 
based on analog signals, the CCTV systems are a local television system in which 
TV scenes are transmitted over a very relatively short distance. Conventional CCTV 
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Fig. 1.3 Cameras mounted everywhere at Nelson Street Cycleway 


cameras generally used a digital charge-coupled device (CCD) to capture images. 
The captured images were converted into an analog composite video signal, which 
is connected to CCTV monitors and recording equipment with magnetic media, 
generally via coaxial cables. At this stage, image distribution and storage are in 
analog; D/A conversions have been adopted for digital image and video processing. 

Surveillance systems at this stage were manually based and utilized to recognize, 
track, and identify any objects, different kinds of security infractions and viola- 
tions are based on human observations by staring at monitor screens. 1G intelligent 
surveillance systems have the advantageous characteristics of visual performance. 
However, D/A conversion has a delay which highly generates image degradations 
and the analog signals were sensitive to noises and easily disturbed by strong elec- 
tromagnetic sources. Furthermore, security staff is easily exhausted and distracted to 
multiple-target tasks after several hours of gazing at the cathode ray tube (CRT) mon- 
itors. Therefore, 1G surveillance systems are not sufficient for automated processing. 
The typical challenges in 1G systems include digital and analog signal conversation, 
digital video recording and storage, artifact removal, and video compression. 

The technological improvement led to develop the 2G surveillance systems, 
namely semi-automatic surveillance systems. Compared to the 1G systems, 2G sys- 
tems have the merits of increasing the efficiency by using computer vision-based 
visual content analysis. The research problem of this generation usually lies in 
robust object detection and object tracking. Current research focus of 2G surveil- 
lance systems orientates to the objectives of live broadcasting in real-time and robust 
computer vision algorithms, machine learning of scene variability and patterns of 
human behaviors, bridging the gap between the physical world and natural language 
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interpretations. The most prevalent technologies at present are serving for the 2G 
surveillance systems. 

The 3G surveillance systems concentrate on automated multimodal system, intel- 
ligent distribution systems (integration and communication), multisensor platforms, 
and data fusion; probabilistic reasoning has been adopted in this phase. 3G sys- 
tems are working on wide area based with more accurate information due to the 
deployment of distributed multisensors. On the other hand, existing problems of 
3G systems include the lack of information integration and communications. Cur- 
rent research problems of 3G surveillance systems encompass distributed and cen- 
tralized systems, data fusion, probabilistic reasoning, and multicamera surveillance 
techniques [7,11,31]. 

In 3G systems, the difficulties we encounter are related to scene understanding 
such as indoor, outdoor, crowds, scene geometry, and scene activity. The relevant 
technologies include data fusion, ontology, and syntax and semantics. Among them, 
the secure communications between various modules apparently are very critical. 
Therefore, we need to understand its communication protocols, metadata, real-time 
systems, etc. 

Our goals in intelligent surveillance are how to save security staff’s time and labor, 
how to make right alarms so as to reduce false ones and ensure an alarm is positive, 
right and effective. The reduction of human labor in intelligent surveillance depends 
on the maximum area of scanning region and sensitivity of the sensors whether 
the system can effectively and promptly extract and analyze the data captured by 
cameras. False alarming is a prominent issue in surveillance which is highly related 
to robustness and reliability of a surveillance system. False alarming not only wastes 
security staff’s time in processing those meaningless information, but also potentially 
causes a real suspect to skip through the monitored area without any alerts. Therefore, 
reducing false alarms is the paramount goal of all surveillance systems; meanwhile, 
we should be aware that false alarming is hard to be prevented as the techniques of 
object detection, because tracking and recognition at present are still far away from 
the practical needs (Fig. 1.4). 


Fig. 1.4 Hyperwall for the 
purpose of display 
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1.1.2 Open Problems in Intelligent Surveillance 


Corresponded to difficulties in intelligent surveillance, open problems in intelligent 
surveillance are grouped into the categories such as hardware, software, objects, 
and events and ethics. At hardware layer, typical problems are sensor deployment 
and management, data capturing, compression, and secure transmission. At object 
layer, the issues include data semantic analysis and scene understanding, object 
analysis such as fire and smoking. At event layer, the difficulties may have event 
database creation and exploration, event search and retrieval, mining and reasoning, 
event exploration and presentation using Web/mobile [16]. At ethics layer, we have 
to face the problems like privacy preservation and protection in social networks, 
supercomputing (FPGA/GPU), and cloud [39] and mobile computing. 

Digital event is a key issue in intelligent surveillance since we need a good story 
which bridges the gap between cyberspace and our real world. However, the defi- 
nitions of an event across various research fields are very diverse and only reflect 
content of designated applications. For example, in textual topic detection and extrac- 
tion, an event is something that happened somewhere at a certain time; in pattern 
recognition, an event is defined as a pattern matched with a class of patterns; in signal 
processing, an event is usually triggered by a state change. 

In surveillance, an event at least consists of two aspects: spatial and temporal. 
It could be either instantaneous or spanning over a period of time. Moreover, it is 
atomic that an event in surveillance is in accordance with how elementary events are 
defined within a surveillance system. It is noteworthy that an event should be with a 
particular object that is correlated with others to constitute a whole scene which is 
meaningful and worthwhile to be studied. 

Data fusion is an integration process of multiple data and knowledge representing 
the same real-world object into a consistent, accurate, and useful representation. The- 
oretically, data fusion in surveillance systems denotes techniques and tools which 
are used for combining sensor data, or data derived from sensory array into a com- 
mon representational format. In data fusion, the goal of intelligent surveillance is 
to improve, filter, and refine the quality of surveillance information. A multisensor 
data fusion system is a major component in the fields dealing with pattern detection, 
recognition, etc. [7]. The aggregation of several sensors to achieve better results is 
always expected. 

Multisensor data fusion improves the performance of surveillance systems in four 
ways, namely representation, certainty, accuracy, and completeness. The character- 
istics of representation denote that output information obtained during or after the 
process of data fusion has richer semantic meaning than that of individual input 
data. Certainty refers to the probability of data which is obtained by applying fusion. 
Similarly, accuracy is explained as the standard deviation of the data after the fusion 
process is smaller than that of input data. Lastly, completeness means new informa- 
tion is added into the understanding of certain environments after fusion. There are 
also four types of data fusion, namely fusion across sensors, fusion across attributes, 
fusion across domains, and fusion across time. 
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To implement the scalability of a large surveillance system, surveillance ontology 
is necessary to be considered. Ontology is related to the nature of existence as well 
as the basic categories of objects and their relations. Traditional ontology is listed 
as a part of the major branch of philosophy known as metaphysics. Ontology deals 
with questions concerning what entities exist and how such entities can be grouped 
within a hierarchy and subdivided cluster according to similarities and differences. In 
surveillance, each camera or sensor is treated as an agent and the ontology is adopted 
to standardize and merge these types of information under specific environments. 

The implementation of semantic translation in surveillance is closely related to 
ontology. Because various sources are used to detect events with different types of 
information [5], a formal event model with the ability to define all its elements with 
clear semantics is required not only to represent events themselves but also to make 
inferences for subsequent alarm making. 

Secure communications between various modules in surveillance are also very 
thorny. Therefore, we need study the communication protocols, metadata, real-time 
system, etc. These include distribution of processing tasks as well as creation of 
metadata standards or protocols to cope with current limitations in bandwidth capac- 
ities. 

Metadata is also called data of data which means additional information of a given 
set of data. Structural metadata is about design and specification of data structures 
and is more properly called “data about the containers of data.” On the other hand, 
descriptive metadata is about individual instances of application data and the data 
content. For instance, every pixel in a digital image is treated as the data of the image 
itself, while parameters of the camera, resolution of this picture and its created date 
and time are stored in the file header (e.g., EXIF) which are thought as the metadata 
of this image. 

Big data usually includes datasets beyond the ability of commonly used software 
tools to capture, curate, manage, and process within a tolerable elapsed time, i.e., 
increasing volume (amount of data), velocity (speed of data in and out), and variety 
(range of data types and sources). Big data uses inductive statistics and nonlinear 
system identification (regressions, nonlinear relationships, and causal effects) from 
large sets of data with low information density to reveal relationships, dependencies 
and perform predictions of outcomes and behaviors. Sampling (statistics) enables 
the selection of right data points within the dataset. While big data continues to 
grow exponentially, visual surveillance has become the biggest data source since 
all of these surveillance cameras capture a huge amount of video and images while 
feeding them into cyberspace daily [13]. 

In recent years, with the abusive usages of surveillance data generated as a result 
of the frequent occurrence of security issues, concerns such as human right, ethics, 
privacy [6], and authentication are needed to attract attentions from publics. The 
abusive usage of surveillance data is severely harmful to the whole society as relevant 
private information might be leaked through surveillance systems, once surveillance 
data are intercepted, the entire public would experience trust crisis. Therefore, social, 
ethical, organizational, and technical issues are urgent for us to provide right solutions 
such as social networks typically to protect individual’s privacy [6]. 
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Our computer system has confronted with challenges in computing. The bottle- 
neck lies in CPU speed for computations that has not taken over great step in so 
many years. Parallel computing technologies including multicore and multithread 
programming have been applied to accelerate in real applications. Coming up with 
the hardware such as graphics processing unit (GPU) and field-programmable gate 
array (FPGA), the accumulations could be sped up and expedited. 

In supercomputing, networking could join computing forces from all aspects 
together. Mobile, cloud, and social networking have played pivotal roles in mod- 
ern intelligent surveillance and will push surveillance events as notifications into 
those feasible and convenient devices in easy ways [39]. 

In this book, we will organize our content to explain these issues in different 
chapters: 


Sensor deployment, calibration, and management 
Media capturing, compression, and secure transmission 
Media semantic analysis and understanding 

Object tracking and event detection 

Decision making and alarm making (e.g., fire, smoking, etc.) 
Event database and exploration 

Big data surveillance 

Discrete event computing 

Event presentation 

Ethics issues/privacy in surveillance 

Supercomputing (e.g., GPU, FPGA, etc.) 
Cloud/mobile computing 

Artificial neural networks and deep learning 


1.2 Introduction to Sensor Deployment and Calibration 
1.2.1 Sensors 


Surveillance sensors are deployed everywhere working under all weather conditions, 
all of them even have been connected as a network, and some of them even have been 
connected to the Internet [15] shown in Figs. 1.5 and 1.6. For all sensors, we would 
like to group the connecting models of surveillance sensors into below categories: 


e Visual sensors: the sensors include analog camera, network camera, IP camera, 
Web camera, wireless camera, and infrared camera. A recent new camera Dynamic 
Vision Sensor (DVS) is also called silicon retina which successfully simulated 
Human Vision System by using the neural networks [8]. The indispensable cam- 
eras could capture not only images but also video signals. 
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Fig.1.5 Sign of surveillance for the purpose of preventing crimes 


Audio sensors: the sensors are related to acoustic, sound, and vibration sensors 
for audio, sound, and voice recording; the recorded audio data also has metadata. 
Automotive sensors: these sensors are used for transportation, electric current, 
electric potential, magnetic, radio for signals. The sensors could reflect the real- 
time situation changes while working without stopping. 

Environmental sensors: the scene sensors could record the data reflecting moisture, 
humidity, saturation, and temperature for environmental monitoring; 

Flow sensors: the fluid velocity-related meters are specially designed for flow 
detection, such as wind and water flow. The flow changes could be revealed in the 
velocity. 

Navigation sensors: the sensors have absolute and relative positioning function- 
alities which have the navigation instruments embedded inside such as mobile 
phones. 

Locating sensors: the typical sensor is Global Positioning System (GPS); other 
sensors are for measuring altitude, angle, displacement, distance, speed, and accel- 
eration. 

Others: Other sensors are specially designed for optical, force, density, pressure, 
thermal measurement, etc. 
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Fig.1.6 Satellite system for earth monitoring 


For sensor deployment, communications, cooperations, control, and power man- 
agement are the fundamental issues in intelligent surveillance. Sensors are typi- 
cally connected together using the topological ways such as centralized, distributed, 
hybrid, and multitier. Like computer networks, sensors have been linked to a central 
server or mutual or multitiers at various levels. The fact is that hybrid and multitier 
connections of miscellaneous sensors are taking effects in real surveillance systems. 

A centralized network shown in Fig. 1.7 is the one where all sensors are connected 
to the central server for the purpose of data transmission in which the server stores 
both communications and user account information. Most public instant messaging 
platforms adopt the framework of a centralized network. 

A distributed network shown in Fig. 1.8 is spread over heterogeneous networks. 
Distributed networks have not the central server which is emphasized on centralized 
network. Instead, all the server nodes are being connected in a distributed status. 
This provides such a direct and single data communicating network. 

Apart from the shared communications within a network, a distributed network 
often shares the data transmission and processing. A hybrid network shown in Fig. 1.9 
utilizes multiple-communication standards or protocols simultaneously; that means, 
a network is made up of equipment from multiple vendors. It also is understood that 
the network mixes more than one networking technologies together. The concept of 
multitier sensor network has arisen as a correspondence to the concept of single-tier 
sensor network. 
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Fig. 1.7 Centralized connection architecture of surveillance sensors 


Single-tier network is one that reaches every other network on the Internet with- 
out purchasing IP transit or paying settlements. In single-tier networks, a set of 
application-oriented requirements and tasks are accomplished by an appropriate sen- 
sor and embedded platform which has the capability to charge the entire network. 

A multitier sensor network shown in Fig. 1.10 is a hierarchical network of sensors 
which are connected in the shape like a tree. With the inheritance of several levels of 
reliability and expenditure based on the type of sensor adopted for application tasks, 
multitier networks flexibly organize the construction of whole sensor network based 
on different goals and requirements. Centralized, distributed, and hybrid as well as 
multitier sensor networks are shown in the figures of this section. 

The deployment of sensors has a great influence on performance and a vast cost 
of surveillance systems. Redundant sensors increase processing time and costs of 
installation. On the other hand, lack of sensors may cause blind regions which reduce 
the reliability of the monitoring system. Thus, it is true to simulate and deploy sensors 
beforehand so that the configuration covers the entire region with the minimum 
number of sensors and costs [15]. 

An example showing more advanced technique which involves intelligent object 
detection and camera self-calibration [4,20,21,41] is illustrated in Fig. 1.11. The 
cooperation between these cameras is implemented by using camera communi- 
cations which are event-driven based; these cameras are able to transfer mutual 
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Fig. 1.8 Distributed connection architecture of surveillance sensors 


ass 


Fig.1.9 Hybrid connection architecture of surveillance sensors 
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Fig.1.10 Multitier connection architecture of surveillance sensors 


Camera 3 


Fig.1.11 Multicamera calibrations 


Camera 2 


14 1 Introduction 


messages and decide automatically what measurement should be taken to deal with 
the ongoing events. Automatic camera calibration and management methods are 
under development which addresses the camera network calibration. The objective 
is to remove, or at least reduce the amount of manual input required. 


1.2.2 Surveillance Cameras 


Surveillance cameras can be integrated into the devices like smartphones and cars 
which could be connected to central servers in wired or wireless way. There is 
a broadwide spectrum of camera lenses for surveillance cameras like wide-angle 
or fish-eye lenses. Therefore, a vast amount of geometric distortions from various 
cameras may be generated, the principle of a camera is shown in Fig. 1.12. The 
distortion removal is preprocessed in visual surveillance and is also one of the primary 
tasks in digital image processing. 

There are two basic purposes for deploying surveillance cameras, namely inves- 
tigation and deterrence. The data recorded by surveillance cameras will most often 
be used to review a crime commitment or unexpected accident or incident, so that a 
channel is provided for security staff to understand and search back what really hap- 
pened there. However, the cameras themselves also have a deterrent value because 
people who know they are being watched usually performed on their best behaviors. 
In order to implement the maximum investigative and deterrent values of the cameras, 
it appears necessary to carefully choose where we place these cameras and which 
direction we point them in. Generally, the four locations where security cameras are 
often installed include entrances and exits, shopping customers’ transaction points, 
targets, and secluded areas. 

We provide a camera system installed on a school bus as the example shown in 
Fig. 1.13. In this bus, two cameras point in the front door and back door. Inside the 


Fig. 1.12 Nature of a pinhole camera 
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Fig.1.13 Bus surveillance system with the deployment of multiple cameras 


bus, multiple cameras are lodged and oriented to passengers; a screen is fixed to 
show the pictures sourced from all camera. Outside the bus, multiple cameras are 
monitoring the surroundings including in front, back, and lanes beside, providing 
road monitoring services as shown in Fig. 1.13. 

Determining a digital camera’s field of view (FoV) is complicated; the exact 
view can be various depending on the focal length of each camera. This means that 
a standard camera placed at a far corner of a 15x 15-square-meter room will still 
capture the most area of the scene. It is discovered that the smaller the lens and the 
shorter the focal length, the wider the field of view (FoV). Cameras with longer focal 
length often perform better than those are not with so larger lens to capture images 
at a long distance. Many of today’s cameras come with the ability to automatically 
adjust the lens in order to meet the needs of its users. This “varifocal ” ability is a 
great feature when unseeing which focal length is required. Digital camera lens with 
fixed focal length are less expensive and generate less distortions. 
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The view of a single camera is finite and limited by scene structures. In order to 
monitor a wide region, such as tracking a vehicle through the road network of a city 
or analyzing the global activities happening in a large train station, videos taken from 
multiple cameras simultaneously have to be adopted. Wide-area surveillance control 
technique is based on large-scale data analysis and management for covering a broad 
territory in intelligent visual surveillance. It is a multidisciplinary field related to 
computer vision, pattern recognition, signal processing, communication, multimedia 
technology, etc. Wide-area surveillance control technique is related to multisensor 
control and cooperative camera technique [7]. 

There are often large changes of viewpoints, illumination conditions, and camera 
settings between views of the field. Matching the appearance of objects across camera 
views is difficult. The topology of an extensive camera network could be complicated, 
and scene deployment limits the FOV of cameras. Some camera views are disjointed 
and may cover multiple ground planes. These bring great challenges for camera 
calibration, inference, and links in topology. 

In order to monitor a wide area with a small number of cameras and to acquire 
high-resolution images from optimal view, a number of surveillance systems employ 
both static cameras and active cameras, whose panning, tilting and zooming (PTZ) 
parameters are automatically and dynamically controlled [19]. Calibration, motion 
detection, object tracking, and behavior analysis with hybrid cameras face new chal- 
lenges compared with only using static cameras. 


1.2.3 Camera Calibration 


Camera calibration originally refers to correct image distortions which are the phe- 
nomenon that lines of the image are curved due to the surface irregularities of camera 
lens when generating the images from sensors [40]. Geometric distortion is separated 
into two categories: internal distortion and external distortion. Internal distortion [29] 
results from the geometry of the lens (as radial distortion [38], projection distortion, 
and stepwise distortion); external distortion is owning to the shape distortion of the 
object; an example is shown in Fig. 1.14. 

Image distortion detection is an essential problem in digital image processing. A 
vast majority of techniques have been applied to image distortion detection and cor- 


Scene Image Corrected image 


Fig. 1.14 Example of image correction using camera calibration 
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rection. The scale-invariant feature transform (SIFT) has been employed for detec- 
tion; meanwhile, the theory of Kolmogorov complexity has been applied to nor- 
malized information distance [17]. According to general affine transformation, we 
have 


x X 
yl _ Y 
l= R Z +t (1.1) 
1 1 
where, 
r11 12 113 
R= | rai r2 123 (1.2) 
131 132 133 
ti 
t=|bh (1.3) 
13 


Furthermore, we denote the homogeneous matrix [R | t] as, 


rii F12 r13 Íi 
[R | t] = | r21 r2 r3 t (1.4) 
r31 132 133 13 
Mathematically, a perspective projection for simulating a camera projection takes 
use of Eq. (1.5) below: 
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Therefore, we simplify the equation as 
s-m=A-[R|t]-M (1.7) 
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(X, Y, Z) is the coordinate of a 3D point in the world coordinate system, (u, v) are the 
coordinates of the projection point in the corresponding pixel, A is called a camera 
matrix or a matrix of intrinsic parameters, (Cy, Cy) is a principal point that is usually 
at the image center, (fx, fy) are the focal lengths, and the joint rotation—translation 
matrix [R | t] is called a matrix of extrinsic parameters [3]. 

Along with camera modeling, the algorithm of image correction or camera cali- 
bration is taken part into the below steps [40]: 


e Corner extraction: corners are the salient points in an image. A corner usually 
refers to the intersection between two or more edges. Computationally, corner 
points are associated with the Hessian matrix having the maximum eigenval- 
ues. [17] 

e Point ordering: detected corners will be sorted and ordered. This step is to prepare 
for the matching of two similar images. 

e Point correspondences: the ordered corners need to search their matching points 
between two similar regions manually or automatically. Usually, a finite number 
of corners will be selected for matching purpose at the initial stage. 

e Bundle adjustments: once a number of corners have been assigned to the 
corresponding matches, the bundle adjustment automatically corrects the remain- 
ing points to the right positions so as to align the calibration purpose. 


For a single camera, the calibration is based on limited points in 3D space 
with coordinates corresponding to the pixels of an image. From these limited 3D 
points [27], all pixels of the image corresponding to all the points in the 3D space 
will be adjusted; it is called bundle adjustments. 

For multicamera self-calibrations [10,22,25], we need set a scene first in use of a 
group of cameras to capture images of the fixed scene. From the captured images and 
the constraints between cameras and objects in this scene, we seek the corresponding 
coordinates of objects and cameras in the 3D Cartesian system. The method of least 
squares, a standard approach in regression analysis, is applied to find the approximate 
solution of overdetermined systems. If any cameras or objects in the scene have been 
slightly changed, the multicamera self-calibration system has the ability to find the 
extrinsic and intrinsic parameters using the method of least squares [1,23,33]. 

Camera calibration is a fundamental problem in computer vision which is indis- 
pensable in visual surveillance. There has been a huge number of literatures on 
calibrating cameras with respect to a 3D world coordinate system. Both the intrinsic 
parameters (such as focal length, principal point, skew coefficients, and distortion 
coefficients) and extrinsic parameters (such as the position of the camera center and 
the camera’s orientation in world coordinates) of cameras are needed to be estimated. 
The process to find the intrinsic and extrinsic parameters varies in time-consuming, 
especially when the number of cameras is very substantial. 

Estimating the intrinsic parameters is a prerequisite to a wide variety of machine 
vision tasks related to motion and stereo analysis [35]. Intrinsic parameters of an 
object contain aspect ratio, shift, rotation, zoom [2], skew, focal length, location, 
and so on. Without any information regarding the intrinsic parameters, a sequence 
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of images can yield 3D structure up to an arbitrary projectivity of 3D space. In order 
to upgrade this projective reconstruction based on a Euclidean distance [34], the 
camera intrinsic calibration parameters are crucial. 

Camera calibration is essential for multiple camera systems. Fusing and trans- 
ferring information between views depends on the relationships among the cameras 
within the settings. Camera calibration manually is a time-consuming task. When- 
ever a camera is displaced, removed or added, it needs to be re-calibrated. In a 
synchronized network having multiple cameras, this becomes a significant hurdle to 
successfully deploy surveillance systems automatically. The system aims to detect 
and track changes and update the calibration information timely. If the views of two 
cameras have substantial overlapping, a homography between them can be computed 
during calibration time. A slew of approaches automatically select and match static 
features from 2D images to compute an assumed homography between two cam- 
era views and align multiple camera views to a single rectified plane. The chosen 
features are typically corner points, such as Harris corners and scale-invariant fea- 
ture transform (SIFT) points. The matching needs to be robust to the variations of 
viewpoints: 


e Projective geometry: shown in Fig. 1.11, the subject projective geometry is a 
branch of geometry dealing with the properties and invariants of geometric figures 
under projection. Projective geometry describes an object from the way as it 
appears to be. Projective geometry studies geometric subjects including lengths, 
angles and parallelism which become distorted when we look at specified objects. 
Further, projective geometry is also a mathematical model for how images of the 
3D world are formed. 

e Kruppa equations: Kruppa equation [18,36] maps the estimated projection matri- 
ces to Euclidean ones. By tracking a set of points among images of a rigid net, 
captured by a moving camera with constant intrinsic calibration parameters, the 
latter can be estimated by determining the image of an absolute conic which is a 
special one lying at the plane in infinity having the property that its image pro- 
jection depends on the intrinsic parameters only [12,26]. This fact is expressed 
mathematically by using Kruppa equations. 

ky ko k3 
The matrix K = | K2 k4 ks | is known as Kruppa coefficient matrix. By using 
k3 ks 1 
the method of Cholesky factorization, one can obtain k = DU V*. D is a lower 
triangular matrix with real and positive diagonal entries, U is a diagonal matrix 
and V* denotes the conjugate transpose of V. 

e Motion constraints: its purpose is to generate a specific instance of a motion 
through a low-dimensional task [30]. The main idea is to create a low-dimensional 
motion constraint from a sparse set of captured motion. Related subjects are trans- 
lation, rotation, shaky, pan-tilt, radial lens distortion, critical motions sequence, 
and so on. Relevant motion constraint equation is therefore derived and is used to 
calculate optical flow with additional constraints [32]. 
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e Scene constraints: typical scene constraints include vanishing points, orthogonal 
directions, rectified planes, and stereo-vision system [17,37]. 

e Calibration techniques: apart from the issues in self-calibration, techniques [24, 
28] such as bundle adjustment, nonlinear minimization and online error estima- 
tions are also needed to be aware of. 


1.3 Questions 


Question 1. Why should we study the intelligent surveillance? What are advantages 
of a digital surveillance system? What’s a CCTV system? 

Question 2. What are the intrinsic and extrinsic parameters of a camera? 

Question 3. When should we conduct camera calibration? What is multicamera self- 
calibration? 

Question 4. What are the relationships between field of view (FoV) and focal length 
of a camera? 

Question 5. What is the pipeline of surveillance systems? 
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Surveillance Data Capturing 
and Compression 


2.1 Data Capturing Using FSM 


The theory of finite state machine (FSM) was originally from mathematics. In FSM, 
the state is uniquely determined by the current state and a single input. FSM has 
been employed to the applications that need to recognize or control the execution 
of a sequence of operations [16]. Most programmable devices contain controllers 
that are designed using FSM(s) [6,7,9,23,25]. FSMs could be found in almost all 
electronic products in home, office or a car, consumer electronics such as cellular 
telephones and CD players [8], even dishwashers and washing machines, etc. 

The concept state diagram is commonly used in software engineering [13]. State 
diagram from unified modeling language (UML) in software engineering is a vital 
component used to describe FSM as a chart of the life-span of an object. Life-span 
refers when the object starts and ends; it is a special chart for content description. 
It has a start state and a finish state both at least. Connections between the states 
represent for those transitions that occur if the conditions are met. It is possible for 
various transitions of states to be produced between the time from the start state and 
the finish state. Furthermore, a state diagram also includes events, guard conditions, 
activities, and actions. 

State is an abstraction of the attribute values and links of an object. The state 
reflects attributes and object links. The object defined in state could be either physical 
or semantic object. When the target object is a suspicious person in surveillance, state 
of the object could be human motion such as walking, sitting, standing, and sleeping. 
In a typical FSM, there exist various states during the life-span of state diagrams such 
as alive, asleep, and await which refer to state activated, state closed, and state ready 
to be processed, respectively. 

Event refers to something that happens or starts at a point in time which is defined 
by the National Institute of Standards and Technology (NIST) US [4]. It only tells 
us when the transition starts and it is not used to discuss the whole process or parts 
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of life-span. An event in a FSM is usually used to illustrate the action that happens 
before the transition between two states. 

Activity is an operation that takes a period of time to be processed. During the 
processing time, an event is linked to its states. 

Action is an operation performed before or after a state changes. Consequently, 
conditions normally trigger actions. The action occurs by entering into the new state. 
Each action represents a change in the process which is programmed (timer starting, 
counter increasing) or hardware based (actuators). There are normally two types of 
actions which are used to depict the different status of a state. 

State diagram derives from software engineering. Each state in a state diagram 
has its own activity. The diagram helps us to understand a problem in software 
engineering, and it also helps us in design, implementation, and testing, especially 
for task analysis and task refactoring. In software engineering, a state diagram is 
associated with structural diagrams and behavioral diagrams. Structural diagrams 
include class diagram, object diagram, component diagram, deployment diagram, 
while behavioral diagrams encapsulate use case diagram, sequence diagram collab- 
oration diagram, activity diagram, etc. [21]. 

A FSM is a model of behavior composed of a finite number of internal states (or 
simple states), the transitions between these states, the set of events that occur, and 
actions perform. The internal state describes status of the machine based on past 
events or actual inputs, thus reflecting history of what happened from the moment 
system starting up. Between two states, we must have activities or actions; a FSM 
diagram is different from the flowchart in programming. 

Flowchart includes start and end points with ellipse shapes; between them, a dia- 
mond shape is used for condition selection (if < condition >, then < structure >), 
and a rectangle shape refers to an operation; a flowchart matches a logic order; a 
FSM is an abstract model of a machine with a primitive internal memory [12]; thus, 
a FSM must run with a memory and closely relate to the computer memory. 

The FSM shown in Fig. 2.1 could be unfolded in a time series like HMM or 
RNN [11,15] as shown in Fig. 2.2. 
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Fig.2.1 States and transitions of the FSM of a door. Door state 1: open; door state 2: closed 
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Fig.2.2 Unfolded FSM of a door with time f; and f) as well as the states sı and s2 


Table 2.1 State transition table 


Inputs State sı State s2 State - - - 
Input tı 

Input f2 < action > 

Input 73 


A FSM definition including the full actions is presented by using Fig. 2.2. Cor- 
respondingly, we show the state transitions of this FSM in Table 2.1; this table is 
called as state transition table. 


Given a sequence of inputs J = {i1, i2,..., in}, the FSM generates a sequence 
of outputs O = {01, 02,..., om } which are independent on the initial state sı € S, 
S = (S1, 82,..., SK ), a transaction function T (S, J) maps each current state and input 


to next state and an output function O(I) maps each current state to an output; S is 
a finite and nonempty set of states. 

A deterministic FSM [4] is a quintuple (7, S, s1, T, F); I is the input alphabet 
(nonempty set of symbols), sı is an initial state, an element of S; T is the state 
transition function; F is the set of final states, a subset of $, F C S. 

For two states sı and s2 in a FSM, they are equivalent if and only if sı and s2 
generate the identical output sequence. For two FSMs are equivalent if and only if 
(i.f.f.) every input sequence T (sj, i) produces identical output sequences. 

Given a FSM, we hope to find the simplest equivalent FSM with the minimum 
number of states using the equivalence existing in FSMs by marking each pair of 
states (s1, $2) as nonequivalent if they have different outputs. For each input i, we find 
and mark the state pair (T (s1, i), T (s2, i)); we compute the marks iteratively until no 
more marking is possible. After all the pairs of states are marked as nonequivalent, the 
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Fig. 2.3 A surveillance state diagram 


unmarked state pairs are all equivalent. This procedure is called FSM minimization. 
Accordingly, the FSM has been simplified. 

A typical example of the FSM is the states and transitions between them as shown 
in Fig. 2.1. The door has two states, namely open and closed. The two states could 
be transmitted between each other if somebody opens or closes the door. In this case, 
the object is the door which is to be opened or closed. The two states that are to be 
used in describing the status of door are opened and closed. The triggering condition 
which decides the states of the door is that someone opens the door or closes it. The 
action is the person starts opening the door or the person finishes closing the door. 

For example, we have a surveillance state diagram shown in Fig. 2.3. This state 
diagram tells us the story what happens in this room and shows the events: a person 
coming in, sitting down, standing up, drawing, and walking out. After we make out 
the states, we could start the programming that is able to detect the events [13,22]. 
Specifically, the events described in the diagram include: 


e Walk in and sit down event (a) 

e Stand up and walk out event (b) 

e Walk in, sit down, stand up, and walk out event (c) 
e Draw on walls event (d). 
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Surveillance event is a semantic unit of sensor data. When surveillance data is 
being captured, we should think how to depict a semantic data in surveillance. Thus, 
we use the concept “event” as the semantic unit. Surveillance events bridge the gap 
between our physical world and the semantic cyberspace. The semantic concept 
refers to object description with meanings such as “tree” or “apple,” “sitting,” and 
“walking.” [3] 

In surveillance event computing, we need to investigate the typical “SW” ques- 
tions: who, when, where, what and why. It just likes the fundamental elements of a 
piece of news from daily newspapers. In surveillance, if we need to present a story 
or an event, we must find out and define the content of the “SW” that are related to 
an event. Therefore, we need to answer the following questions in surveillance. 


Who participated in the event? 
What is the event? 

Where did the event take place? 
When did the event take place? 
Why did the event happen? 


In surveillance, “who” is related to main object or role, biometrics provides assis- 
tance in solving this problem, such as face recognition, fingerprint, iris, gait, or 
pedestrian recognition or plat recognition of cars, buses, and trucks. “When” refers 
to date and time, namely time stamps; “where” is the location information from 
GPS, WiFi, Mac Address, or locating software; if we compose “when” and “where” 
together, we call it spatial-temporal relationship formally. “What” means the content 
description, usually relates to ontology; meanwhile, “why” refers to the reasons that 
are from reasoning and inference. 

There is a typical example of state diagram in surveillance that controls the func- 
tion of audio capturing. Within the FSM, there are only two states, namely recording 
and sleeping (Fig. 2.4). The events might occur in the FSM which contains the states 
recording and sleeping, activate an alarm from a sleeping, and inactivate the alarm 
during sleeping from recording. 

The program starts from cleaning the buffer and flags; the four conditions are 
taken into consideration for audio capturing. If an audio is captured successfully, the 
waveform will be played back. 

The pseudocode below shows the FSM implementation by using the script lan- 
guage of MATLAB. MATLAB is a powerful research tool that has been successfully 
employed to scientific simulations. MATLAB has the capability to run the scripts in 
recursive and parallel. 


Fig. 2.4 A surveillance state 
diagram for audio capturing 


Sleeping 
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The algorithm (1) based on MATLAB is used to record sounds. When we are 
talking, this program starts recording our voice; when stop, the computer starts 
writing our voice data into a wave file and then will play the audio waveform back. 
This program has the potential to help us and broadcast the audio footage through 
the Internet if we install an Apache HTTP server on our computer like the Skype. In 
this example, we capture audio clips using FSM, and the computer will save what 
we talked as a “.wav” file and play it. When we keep talking, the data will continue 
being stored into the running computer. The audio will be broadcasted when a HTTP 
server is set up and has been linked to the voice file with a streaming player. 


Input : Sound signals; 

Output: Audio files; 

Initialization; 

Clean buffer and flags; 

while < True > do 

if < no sound captured > && < the buffer is empty > then 
capture audio; 

end 

if < no sound captured > && < the buffer is not empty > then 
write the buffer out; 

clean buffer and flags; 

play the audio; 

end 

if < sound captured > && < the buffer is empty > then 
add data to the buffer; 

set flag; 

start recording; 

end 

if < sound captured > && < the buffer is not empty > then 
add data to the buffer; 

continue recording; 

end 

end 


Algorithm 1: A surveillance algorithm for audio capturing 


In a general surveillance capturing system, flags, states, and transitions between 
these states, sleeping and recording are shown in Fig. 2.5. The different transitions 
will lead to distinct FSM actions. 

The difficulty in FSM is how to find the threshold of audio and video signals; 
the typical way is to collect the data as a training dataset and seek the interval of 
the threshold value after several times of observations; then, an ideal threshold will 
be fixed by using tests. The feedback and evaluations based on the given threshold 
should be taken into consideration in further study. 
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| Recording <—— States t | 


Recording |<—— States t-1 | 


launa | 


Set Flag t +1 


Fig.2.5 Transitions between surveillance states 


2.2 Data Capturing Using DNN 


In recent years, deep neural networks (DNNs) have achieved a remarkable progres- 
sion in solving many complex problems [26]. DNNs are suitable for dealing with the 
problems related to time series, such as speech recognition [17] and natural language 
processing. Video dynamics detection, as an instance, is time dependent. Apparently, 
video dynamics detection needs to utilize the present, previous, and next frames of 
a given video. If a frame change occurs, it triggers whether a video event happens or 
not. 

Moving object detection is the basestone of object monitoring and tracking. The 
fast and accurate detection of moving objects has become a hot area in the field of 
video surveillance. According to the existing deep learning models, we now achieve 
high-precision and real-time video dynamics compared to the traditional methods. 

Gated recurrent unit (GRU) is used to construct the recurrent neural network 
(RNN); the GRU cells are employed to declare a GRU unit. The output of convolu- 
tional layer is exported into a batch GRU vector; then, the feed function is used to 
get the output from RNN. Among them, the actual RNN model needs to be carried 
out according to the concrete steps. The output of RNN is treated as input of the fully 
connected neural network layer and the matrix multiplication by using the weight 
matrix to get the final output. After defining the DNN model, we also need to specify 
the loss function and training algorithm to train the model. We use cross entropy 
as the loss function of this model and the adaptive moment estimation (ADAM) 
algorithm as the optimization algorithm. 
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We observed that video dynamics detection based on deep learning is more 
resilient to external influence. Dynamic object segmentation, tracking, and event 
recognition could also be implemented based on DNN [18]. 


2.3 Surveillance Data Compression 


After captured surveillance data [1], we need to find a space and deposit them. 
The surveillance data will be compressed because surveillance devices usually are 
operated in all day long and all year round. Various standards have been developed 
for the purpose of data compression [5], which are grouped into two categories: 


e Still image compression. Each video frame is encoded independently as a static 
image. 

The most well-known standard of image compression is JPEG and JPEG 2000. 
JPEG stands for Joint Photographic Experts Group. Motion JPEG (M-JPEG) 
standard encodes each frame using the well-known JPEG algorithm. 

e Motion estimation. Video frames are defined as a group of pictures (GoP) in which 
the first frame is encoded independently. For the other frames, only differences 
between the previous frames and itself are encoded. Typical standards are MPEG- 
2, MPEG-4 (H.263, H.264, H.265), MPEG-7, MPEG-21, etc. MPEG stands for 
Moving Picture Experts Group [10]. 

Generally, there are three methods for the still image compression: 

e Discrete Cosine Transform (DCT). 

DCT (DCT) is a lossy compression algorithm, and it has been successfully applied 
to JPEG, MPEG, H.26x, etc. DCT stands for a finite sequence of data points in 
terms of cosine functions cos(n - 0) oscillating at different frequencies; n is the 
frequency as shown in Fig. 2.6. DCT is important to a number of applications in 
engineering, from lossy compression of audio signals (e.g., MP3) to images (e.g., 
JPEG) where some high-frequency components usually are discarded. 

Figure 2.6 shows changes in the frequency of one of the trigonometric functions. 
Interestingly, the trigonometric functions such as y = sin(nx) and y = cos(nx), 
x E€ (—oo, +00),n = 0, 1, 2, . . ., construct an orthogonal function system for dig- 
ital signal processing. 

In particular, DCT is a Fourier-related transform which is very similar to discrete 
Fourier transform (DFT), but using only real numbers for energy compaction. The 
DCT is equivalent to DFT transform roughly twice of the length operating on real 
data with even symmetry. 

The discrete formula of 1D-DCT is denoted as, 


N 
y(k) = wk) J x(n) cos |Æ (2n 1)-(k i) | hd, ec (2.1) 


n=1 
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Fig.2.6 Frequency changes (a) 1 


: AN 
NN 


of the function y = cos(nx), 
an=lbn=Scen=10 
dn = 100 


where N is the length of x, and x, y are the same size. The series are indexed from 
n = 1 and k = 1 instead of the usual n = 0 and k = 0 because the vectors run 
from 1 to N instead of from 0 to N — 1. 


ae = =; 
w(k) = Pa (2.2) 


[P2Sk<N; 


1D-IDCT reconstructs the signals using the coefficients x(k), 
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N 
x(n) = $ w(k)y(k) cos | (2.3) 


k=1 


m(2n — 1)(k — 1) 
2N | 


e Fractal Compression. 
Fractal compression is performed by locating self-similar sections of an image, it 
is employed to compress images by using affine transformation which is a lossy 
compression method and reconstructs an approximation of the original image that 
could be accepted. 


j : 
let ae: — sind EIS] (2.4) 
y sinô -cos y Ay 

where œ and £ are scaling factors in x and y directions; Ax and Ay are the shift 
along x and y, respectively; 6 is the rotation angle along the anti-clockwise direc- 
tion. 
Affine transformation of an image is a combination of a rotation, scaling (or zoom- 
ing), or translation (shift). Fractal image compression first segments an image into 
non-overlapping regions (they can be any size or shape). Then, a collection of 
possible regions are refined. The regions do not need overlap and cover the entire 
image, but must be larger than the domain regions. For each domain region, the 
algorithm searches for a matching region that very closely resembles the domain 
region when applied to an appropriate affine transformation. Afterward, a fractal 
image format (FIF) file is generated for the image. This file contains information 
on the choice of domain regions and the list of affine coefficients (i.e., the entries 
of the transformation matrix) of all associated affine transformations. So all the 
pixels in a given region are compressed into a small set of entries of the transfor- 
mation matrix corresponding to a color value in integer between 0 and 255. This 
process is independent on resolution of the original image. The output graphics 
will look like the original one at any resolution, because the compressor has found 
an iterated function system (IFS) whose attractor replicates the original one. 

e Discrete Wavelet Transform (DWT): 
DWT is a hierarchical representation of an image where each layer represents a 
frequency band. The less important layers will be discarded for lossy compres- 
sion. Wavelets are functions which allow data analysis of visual signals or images 
according to scales or resolutions. 
The DWT represents an image as a sum of wavelet functions with different loca- 
tions and scales that represent the data into a set of high-pass (detail) and low-pass 
(approximate) coefficients. In one-dimensional discrete wavelet transform (1D- 
DWT), the input data passes through a set of low-pass filters and high-pass filters. 
The output of high-pass and low-pass filters is usually downsampled by 2. The 
output from low-pass filter is an approximate coefficient and the output from the 
high-pass filter is a detailed coefficient. In the case of image compressing that is in 
two directions, both rows and columns, two-dimensional discrete wavelet trans- 
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form (2D-DWT) should be taken into account. The outputs are then downsampled 
in each direction. 


K-1 


x(n) = J) x(Q-n—k)- g(k) (2.5) 


k=0 


K-1 


xn(n) = $ x(Q-n—k)-h(k) (2.6) 


k=0 


K-1 


xın, n) = È x(m, Q -n — k) - gk) (2.7) 


k=0 


K-1 


xiu(m,n) = >) x.(Q-m—k,n)- gtk) (2.8) 


k=0 


K-1 


xı,a(m, n) = 5 x(m, Q -n — k) - h(k) (2.9) 


k=0 


K-1 
xi an(m,n) = D> x1,n(Q-m—k,n) - g(k) (2.10) 
k=0 
where g(k) represents low-pass filter while A(k) means high-pass filter, Q stands 
for downsampling filter. 


2.4 Digital Image Encoding and Decoding 


Generally, the steps for encoding and decoding of digital images in JPEG include 
color space conversion, subsampling, DCT and IDCT transforms, quantization, RLE 
coding, etc. 

The presentation of colors in an image is converted from RGB to YCbCr first as 
shown in Eq. (2.11). 


R=Y 41402 «(C= 128) 
G = Y —0.34414 -(Cp — 128) —0.71414 -(C, — 128) (2.11) 
Rav 41972 -(C,— 128) 


However, RGB color space is not the best one. YCbCr and YUV color spaces are 
more effective than the RGB does. Meanwhile, YCbCr and YUV have been encoded 
well with adjustable proportions. 
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As well known, RGB space is nonlinear in perception when it is used to present 
color images. However, our human visual system (HVS) is based on lighting stim- 
ulation; therefore, luminance will play a pivotal role; thus, we hope to separate the 
luminance component (Y) out from the color information [14]. 

The resolution of chroma data is reduced usually by downsampling. Downsam- 
pling refers to reduce the size of an image block in half; consequently, it will greatly 
reduce the file size. Hence, the image is split into blocks comprising of 8 x 8 pixels. 
Each of the Y, Cb, and Cr components undergoes a discrete cosine transform (DCT). 
2D-DCT is based on the decomposition using the sine or cosine function-based 
orthogonal systems. 


M-1N-1 
(2m + 1) m(2n + 1) 
Bog = ApAg > >» COs (1) Cos (5) (2.12) 


m=0 n=0 


where 0O< p <M —1,0<q<N-—1,M and N are the row and column size, 
respectively. If we apply the DCT to image data in real numbers, the result is also 
real. The DCT tends to concentrate energy and select the important elements at the 
upper left corner of 8 x 8 blocks for digital image compression. 


a= vm’? = 0; 
p wi<p<M-1; 
1 — o 
eee VN’ q= 0; 
1 w1<q<N-1; 
The 2D-IDCT is, 
M-1N-1 
m(2m-+ l)p m(2n+ l)q 
Amn = 2 X Op qBpyg COS OM cos 3N (2.13) 


where0<m<M —land0<n<N-l1. 

Subsequently, amplitudes of the frequency components are quantized so as to 
round the value of the DCT coefficients; the high-frequency components are dis- 
carded so as to achieve the purpose of image compression; however, the visual qual- 
ity will not be changed too much, because our human visual system (HVS) could 
not percept any minor changes. 

The resulting data for all 8 x 8 blocks is further compressed with a lossless algo- 
rithm such as a variant of Huffman encoding [5]. The compressed file will greatly 
reduce file size. Huffman coding is the process of finding and using a method for 
construction of minimum-redundancy codes, and it is a technique in entropy encod- 
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ing, including lossless data compression. The algorithm output can be viewed as a 
variable-length code table for encoding a source symbol (such as a character in a 
file). Huffman’s algorithm derives this table based on the estimated probability or 
frequency of occurrence (weight) for each possible value of the source symbol. As 
entropy encoding, symbols or alphabetical letters are generally represented using 
fewer bits than original symbols. The problem which Huffman coding attempts to 
solve is described as below. 

Given an alphabetical set A = {a1, a2, . . . , ay} (N is the length of set A) and a set 
W = {w1, W2,..., wv} which represents the weights of all the elements in A, how 
to encode every element of A in binary code with regard to its weight in the whole 
set. The routine for Huffman coding is shown in Algorithm 2. 


Input : The coding symbols and their probabilities 
Output: The coding tree 

Initialization; 

Create a leaf node for each symbol; 

Add each symbol to the priority queue; 


while < The queue of symbols is not empty > do 
(1) Remove the two nodes of highest priority (lowest probability) from the queue; 


(2) Create a new internal node with these two nodes as children and with probability 
equal to the sum of the two nodes’ probabilities; 
(3) Add the new node to the queue; 

end 

The remaining node is the root node and the tree is complete; 


Algorithm 2: Huffman coding algorithm 


In summary, DCT transfers the energy of a 8 x 8 blocks to the upper left corner. 
This energy helps us encode the image in a very compact way. The decoding process 
reverses these steps. 

In our real world, we have not only integer dimensions, such as 1, 2, and 3 
dimensions, but also fractal dimensions such as the constants: e = 2.718---,7 = 
3, 1415926..., etc. Fractal theory was applied to digital image compression. The 
features of fractal theory include (Fig. 2.7), 


e Recursion: Functions call themselves in programming nest with a condition, e.g., 
function F (float a){ 
if(a>1){ 
a: =al2; 
F(a); 
} 
} 


e Self-similarity: The fractal image is quite similar to itself in details but with dif- 
ferent size (zooming) or rotation angles; an example is shown in Fig. 2.7. 
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Fig. 2.7 Similarity of fractal 
curves 


e Iterated function system (IFS): Fixed-point iteration in mathematics typically is, 
Xn41 =Q-f On) + B,n=1,2,... (2.14) 


where x* = lim xn, x* is the fixed point, œ and £ are constants, f is the iterated 
n—> Co 


function (|a| < 1). 

e Photocopy machine algorithm: The algorithm is used for the purpose of duplica- 
tion based on the mechanism of iteration and IFS system. An example is shown 
in Fig. 2.8 for this photocopy machine algorithm using affine transformations. 


In JPEG 2000, fractal theory has been employed for image compression. The 
similar blocks using affine transformation have been selected as codebook; afterward, 
the affine coefficients are stored; eventually, digital images are encoded using this 
codebook with the coefficients of affine transformations. 

Wavelet transform is a kind of comb filtering. The visual signals usually are 
decomposed into various frequencies by using wavelet orthogonal functions. Those 
important information is included in low-frequency portion, and less important infor- 
mation is contained in high-frequency part. After finite times of filtering like a comb, 
an image will be progressively decomposed. The layered visual signals are employed 
to reconstruct the original image according to the requirements. 

Usually, an image is progressively displayed while the further low-frequency 
details are added into the basic information for the purpose of displays. For image 
compression, images are segmented and stored in hierarchical structure; when the 
details are added into reconstructing process progressively, the final image will be 
reconstructed while unnecessary high-frequency details are discarded. The image 
compression is subject to the acceptance of image quality of human visual system. 
An example of wavelet-based image decomposition is shown in Fig. 2.9, and the 
reconstructed image is shown in Fig. 2.10. 
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Fig. 2.9 Decomposing an image using Haar wavelet at level three 


37 


38 


2 Surveillance Data Capturing and Compression 


Fig. 2.10 Reconstructed image using wavelet components partially 


2.5 Digital Video Encoding and Decoding 


In 


famous YouTube Web site, most of the videos are compressed using MPEG-4 


standard, such as MP4 and FLV. MPEG stands for Moving Picture Experts Group; 
the brief history of the MPEG video compressions is listed below [24]. 


MPEG-1 (1992). This video compression scheme was designed for VCD players, 
and the data stream is up to 1.5 Mbit/s. The level 3 of MPEG-1 has been developed 
as the well-known MP3 format for audio compression. 

MPEG-2 (1994). This video compression standard was for digital broadcast tele- 
vision, and the bit rates are between 1.5 and 15 Mbit/s. 

MPEG-4 (1999, 23 Parts). This widely adopted video compression is for video 
adaptive coding called advanced video coding (AVC) which was designed for 
object-oriented composite files; the block size is variant from 2 x 2, and 4 x 4 to 
8 x 8, even more larger, etc. 

MPEG-7 (2001, 10 Parts). The compression standard is for event-based multime- 
dia content description which uses XML to store metadata in order to tag particular 
events. The standard includes information on content manipulation, filtering, and 
personalization. 
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e MPEG-21 (2002, 9 Parts): This standard is for sharing digital rights/permissions/ 
restrictions of digital content. It will be used for the purpose of copyright protection 
of digital media [2]. 


MPEG-1 has been offered for VCD player in the industry of recreation and enter- 
tainment; however, the resolution is very low; meanwhile, MPEG-2 was designed 
for DVD player; afterward, MPEG-4 is a very popular video standard that is widely 
used in today’s videophone, the Internet and mobile multimedia communications 
such as H.26x, even today’s high dimension (HD) and 4K TV. MPEG-7 is designed 
for content-based event encoding that will be used for semantic description, espe- 
cially for further content-based search, retrieval, mining, and reasoning; the further 
video standard MPEG-21 is being designed for multimedia copyright protection [2]. 
In MPEG, we have two very important concepts: motion vectors (block shifting in 
2D) and group of pictures (GOP); they are explained as, 


I-Frame refers to intra-coded picture/frame. 

B-Frame refers to bidirectionally predicted picture/frame. 

P-Frame refers to inter-coded picture/forwarded prediction frame. 

Motion vector refers to the corresponding position of a macroblock in another 
picture. 


As shown in Fig. 2.11, we start from I-frames; after interpolating between frames, 
the P-frames are obtained. Using motion vectors and group of pictures (GOP)), we 
extrapolate the B-frame. After all decompressed frames are sorted in the display 
buffer in right order, a MPEG video only could be played. In the MPEG family, all 
videos have these I-frame, B-frame, and P-frame in order to be played correctly. 

H.26x standard was designed for the purpose of multimedia communications 
such as videophone conversation; now, this MPEG compression family has been 
developed for the usages of the Internet, mobile phones, and HD TV display. 


e H.261 (1990) was designed for dual communication over ISDN lines and supports 
data rate 64 Kbit/s. The scheme is based on DCT and uses intraframe (I-Frame 
coding) and interframe (P-Frame coding) compressions, utilized primarily in older 


Fig. 2.11 Relationships 
between I-, B-, P-frames of 
MPEG videos 
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videoconferencing and video telephony products. H.261 was the first real digi- 
tal video compression standard. Essentially, all designs of subsequent standard 
video codec are based on it. H.261 includes YC,C, color representation, the 4:2:0 
sampling format, and 8-bit sample precision support progressive scan video. 

e H.262 (1994) was designed by using ITU-T Video Coding Experts Group (VCEG) 
and are maintained jointly with the ISO/IEC Moving Picture Experts Group 
(MPEG); it is identical in content to the video part of the ISO/IEC MPEG-2 
standard. 

e H.263 (1996) is used primarily for videoconferencing, video telephony, and Inter- 
net video. It represents a significant step forward in standardized compression 
capability for progressive scan video, especially for the videos at low bit rates. 
H.263 was developed as an evolutionary improvement based on the obtained expe- 
rience from H.261, the previous ITU-T standard for video compression, as well 
as the MPEG-1 and MPEG-2 standards. H.263 is a video compression standard 
originally designed as a low bit rate compressed format for video conferences 
which is a required video codec for IP multimedia subsystem (IMS), multimedia 
messaging service (MMS), and transparent end-to-end packet-switched streaming 
service (PSS). 

e H.264 (2003) 

The ITU-T H.264 standard and the ISO/IEC MPEG-4 Part 10 or advanced video 
coding (AVC) standard are jointly maintained so that they have identical technical 
content [19,20]. 

H.264 is also known as MPEG-4 which was developed for the use in high- 
definition systems such as HDTV and HD DVD as well as low-resolution portable 
devices. H.264 offers better quality at lower file size than both MPEG-2 and 
MPEG-4 ASP (DivX or XviD). 

e H.265 (2013) 

H.265 is the high efficiency video coding (HEVC), scales from 320 x 240 pixels to 
7680 x 4320, which is used for high dimension (HD) and 4K display technology. 
H.265 requires substantially more computational power to decode than H.264. 
Repeated quality comparison tests have demonstrated that H.265 reduces file size 
roughly 39-44% at the same quality compared to H.264. 


Our media players need the software codec so as to play a video. The codec is a 
software embedded in operating systems for video playing; our players usually use 
these codecs for video playing. The often used codecs are listed as below: 


e Dirac (BBC, 2004) is a prototype algorithm for encoding and decoding raw videos 
and video transmission over the Internet. The aim is to decode standard digital 
PAL TV definition in real time. 

e Indeo (Inter, 1992) Distinguished by being one of the first codecs, it allows full- 
speed video playback without using hardware acceleration. 

e Indeo 5 decoders exist for Microsoft Windows, Mac OS Classic, Unix, etc. 


2.5 
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Fig.2.12 A frame of PETS video frame for compression test 


Motion JPEG (M-JPEG) Each video frame or interlaced field of a digital video 
sequence is separately compressed as a JPEG image (DCT based). The relationship 
between two adjacent video frames has been broken. 

M-JPEG is used by IP-based video cameras via HTTP streams. 

RealVideo (RealNetworks, 1997) is a streaming media format and the network 
design based. 

WMV (Microsoft) was designed for Internet streaming applications as a com- 
petitor to RealVideo. 

DivX (Digital Video Express) uses lossy MPEG-4 Part 2 compression or MPEG-4 
ASP. 


An exemplar video from the famous video dataset of performance evaluation of 


tracking and surveillance (PETS) is shown in Fig. 2.12; the results for comparing 
video compression using different codecs are shown in Table 2.2; we applied var- 
ious compression approaches to the same video and obtained the results after the 
compressions. 
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Table 2.2 An exemplar video frame from the PETS dataset 


Video format File size (MB) 


2.6 Questions 


Question 1. What is the finite state machine (FSM)? How to perform FSM simpli- 
fication? What’s the relationship between FSM and HMM? 

Question 2. Please explain the below concepts in FSM. 

(1) Action 

(2) Activity 

(3) Event 

(4) Object 

(5) State 

Question 3. Draw state diagram of an automated sound recording system using FSM 
and explain it. 

Question 4. Draw state transmission table of an automated sound recording system 
using FSM. What are the relationships between FSM and HMM? 

Question 5. Draw flowchart of an automated sound recording system using FSM. 
Question 6. Write pseudocodes (input, output, and procedures) of an automated 
sound recording system using FSM. 

Question 7. Please explain the steps of JPEG image compression. 

Question 8. Please explain the below concepts in video compression. 

(1) MPEG-1 

(2) MPEG-2 

(3) MPEG-21 

(4) MPEG-4 

(5) MPEG-7 

Question 9. What are the differences between H.265 and H.264? 

Question 10. Please explain the below concepts in a MPEG video. 

(1) I-Frame 

(2) B-Frame 

(3) P-Frame 

(4) Motion vector 

(5) GoP 

(6) Quantization 
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Question 11. Please explain the concept fractal image compression. 
Question 12. How to determine a threshold based on our observations? 
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Surveillance Data Secure 
Transmissions 


3.1 Introduction to Network Communications 
3.1.1 Computer Network 


As well known, seven layers of a computer network provide a variety of services, 
and protocols [1] need various facilities to support the network so as to carry out the 
designed functionalities [12,13]. 


e Layer 1: Physical Layer 


In computer network, physical layer comprises of network hardware for data trans- 
mission. It is a fundamental layer underlying logical data structures of the high-level 
functions. Due to the diversity of available network technologies with broadly vary- 
ing characteristics, this is perhaps the most complex layer in network architecture. 

The physical layer defines means of transmitting bits rather than logical data 
packets over physical network. The bit stream may be presented by using coded 
words or symbols that are transmitted over a transmission medium. The physical 
layer provides a procedural interface to the transmission medium. The frequencies 
to broadcast on, the modulation scheme to use, and similar low-level parameters are 
specified in this layer. 

Within semantics of network architecture, the physical layer translates logical 
communication requests from the data link layer to hardware-specific operations so 
as to affect transmission or reception of signals. 


e Layer 2: Data Link Layer 


Data link layer transfers data between adjacent network nodes in a wide area network 
(WAN) or between nodes on the same local network. The data link layer provides 
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functional and procedural means to transfer data between network entities and might 
provide the means to detect and possibly correct errors that may occur in the physical 
layer. Examples of data link protocols are Ethernet for local networks (multinode), 
Point-to-Point Protocol (P2P) connections [1, 13]. 

The data link layer is concerned with local delivery of frames between devices 
on the same LAN [13]. Data link frames, as these protocol data units are called, 
do not cross the boundaries of a local network. The data link layer is analogous to 
a neighborhood traffic which endeavors to arbitrate between parties contending for 
access to a medium without concern for their ultimate destination. 

Data link protocols specify how devices detect and recover from such collisions 
and may provide mechanisms to reduce or prevent them. This layer takes effects 
through using network bridges and switches. 


e Layer 3: Network Layer 


Network layer provides functional and procedural means of transferring variable- 
length data sequences from a source to a destination host through one or more 
networks as well as maintains the quality of service functions. 

The functions of this network layer include: 
(1) Connection model: This layer works in connectionless communication way. 
For example, a datagram travels from a sender to a recipient who does not have to 
send any acknowledgement. Connection-oriented protocols also exist at other higher 
layers of the network model. 
(2) Host addressing: Every host in the network must have a unique address that 
determines where it is. This address is normally assigned from a hierarchical system. 
On the Internet, the addresses are known as Internet Protocol (IP) addresses. 
(3) Message forwarding: Since computer networks are partitioned into subnetworks 
and connected to others for wide area communications, networks use specialized 
hosts, called gateways or routers, to forward packets between networks. This is also 
of interest to mobile applications, where a user may move from one location to 
another. Version 4 of the Internet Protocol (IPv4) was not designed with this feature 
in mind though mobility extensions exist. IPv6 has a better designed solution. 

Within the service layering semantics of the (OSI) Open Systems Interconnec- 
tion) network architecture, the network layer responds to service requests from the 
transport layer and service requests to the data link layer. 


e Layer 4: Transport Layer 


In computer network, transport layer provides end-to-end or host-to-host commu- 
nication services within a layered architecture of network components and pro- 
tocols [1]. The transport layer provides services such as connection-oriented data 
stream support, reliability, flow control [5], and multiplexing. 
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Transport layers are implemented by using TCP/IP model which is the founda- 
tion of the Internet and the Open Systems Interconnection (OSI) model of general 
networking [20]. 

The best-known transport protocol is the transmission control protocol (TCP) [13]. 
It is used for connection-oriented transmissions, whereas the connectionless user 
datagram protocol (UDP) is used for simpler messaging transmissions. The TCP is 
more complex due to its design incorporating reliable transmission and data stream 
services. Other prominent protocols in this group are the datagram congestion control 
protocol (DCCP) and the stream control transmission protocol (SCTP). 

Transport layer services are conveyed to an application via a programming inter- 
face to the transport layer protocols. The services may include the following features: 


e Connection-oriented communication. It is normally easier for an application to 
interpret a connection as a data stream rather than to deal with the underlying 
connectionless models, such as the datagram model of the user datagram protocol 
(UDP) and that of the Internet Protocol (IP). 

e Same order delivery. The network layer does not guarantee that packets of data 
will arrive in the same order that they were sent, but often this is a desirable feature. 
This is usually done through use of segment numbering, with which the receiver 
passes them to the application in order. This may cause head-of-line blocking. 

e Reliability. Packets may be lost during transport due to network congestion and 
errors. By means of an error detection code, such as a checksum, the transport pro- 
tocol may check that the data is not corrupted, and verify correct receipt by sending 
an ACK or NACK message to the sender. Automatic repeat request schemes may 
be used to retransmit lost or corrupted data. 

e Flow control. The rate of data transmission between two nodes must be managed 
to prevent a fast sender from transmitting more data that need to be supported by 
using the receiving data buffer, may cause a buffer overrun. 

e Congestion avoidance. Congestion control [5] refrains traffic entry into a telecom- 
munication network, so as to avoid congestive collapse by controlling over sub- 
scription of any processing, such as reducing the rate of sending packets. 

e Multiplexing. The ports provide multiple end points on a single node. Computer 
applications will keep listening on their own ports, which enable the use of more 
than one network service simultaneously. It is a part of the transport layer in 
TCP/IP [20]. 


e Layer 5: Session Layer 


Session layer provides the mechanism for opening, closing, and managing a session 
between end-user applications, i.e., a semi-permanent dialog. Communication ses- 
sions consist of requests and responses that occur among applications. Session layer 
services are generally used in the environments that make use of remote procedure 
calls (RPCs). 
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Within service layering semantics of the OSI network architecture, the session 
layer responds to service requests from the presentation layer and issues service 
requests to the transport layer. 

The session layer of the OSI model is responsible for session checking and recov- 
ery which allows information of different streams originating from different sources 
to be properly combined or synchronized. 

An exemplar application is web conferencing, in which the streams of audio and 
video must be synchronous. Flow control ensures that the person displayed on screen 
is the current speaker. Another application is in live TV programs, where streams 
of audio and video need to be seamlessly merged and transitioned from one to the 
other so as to avoid silent airtime or excessive overlap. 


e Layer 6: Presentation Layer 


Presentation layer is responsible for delivering and formatting of information to 
the application layer for further processing or display. It believes the application 
layer regarding syntactical differences in data representation is within the end-user 
systems. 

The presentation layer is the lowest layer at which application programmers should 
consider data structure and presentation, instead of simply sending data in the form 
of datagrams or packets between hosts. This layer deals with issues of string repre- 
sentation. The idea is that the application layer should be able to point at the data to 
be moved, and the presentation layer will deal with the rest. 

Serialization of complex data structures into flat byte-strings by using mechanisms 
such as XML is thought as the key functionality of the presentation layer. Network 
encryption is typically conducted at this layer despite it can be completed at the 
application, session, transport, or network layers, which has its own advantages. 

Network decryption is also handled at this layer. For example, when logged on 
to bank sites, the presentation layer will decrypt the data as it is received. Another 
example is the representing structure, which is normally standardized at this level, 
often by using XML. As well as simple pieces of data, like strings, more complicated 
things are standardized in this layer. Two examples are “objects” in object-oriented 
programming (OOP) and the exact way that streaming video is transmitted. 

In many broadly used applications and protocols, no distinction is made between 
the presentation layer and application layer. For example, hypertext transfer protocol 
(HTTP) and hypertext transfer protocol secure (HTTPS), generally are regarded as 
an application layer protocol, have presentation layer such as the ability to identify 
character encoding for proper conversion, which is then carried out in the application 
layer. 

Within the service layering semantics of the OSI network architecture, the presen- 
tation layer responds to service requests from the application layer and sends service 
requests to the session layer. 
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In the OSI model, the presentation layer ensures the information that the appli- 
cation layer of one system sends out is readable through the application layer. If 
necessary, the presentation layer might be able to translate between multiple data 
formats by using a known format. 


e Layer 7: Application Layer 


In the Internet, application layer is an abstraction layer reserved for communica- 
tion protocols and methods designed for process-to-process communications across 
Internet Protocol (IP). Application layer protocols use the underlying transport layer 
protocols to establish process-to-process connections via ports. 

In the OSI model, the definition of its application layer is narrower in scope. The 
OSI model defines application layer as the user interface which is responsible for 
displaying the information received. The OSI application layer is responsible for 
displaying data and images to users in a human recognizable way and interacting 
with the presentation layer below it. 

OSI separates functionality above the transport layer is at twofold: the session layer 
and the presentation layer, specifying strict modular separation of functionality at 
these layers. It also provides protocol implementations for each layer. 

The application layer is the seventh layer of the OSI model and the only one that 
directly interacts with end user. The application layer provides a variety of services 
including: 


Simple mail transfer protocol (SMTP) 
File transfer 

Web surfing 

Web chat 

email clients 

Network data sharing 

Virtual terminals 

Various file and data operations 


The application layer provides fully end-user access to a large amount of shared 
network services for data flow of the efficient OSI model. This layer has a wealth 
of responsibilities including error handling and recovery, data flow over a network, 
and full network flow. It is also used to develop network-based applications. 

More than 15 protocols are used in the application layer, including file transfer 
protocol (FTP), telnet, trivial file transfer protocol, and simple network management 
protocol [13]. 
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3.1.2 Network Hardware 


Hub is a connection device in a network. Hubs are usually used to connect segments 
of a LAN which contain multiple ports. When a packet arrives at one port, it is copied 
to other ports as well so that all segments of the LAN can see the packets. 

Hubs and switches serve as a central connection for all of network equipment and 
handle a data type known as frames. Frames carry data. When a frame is received, it 
is amplified and then transmitted to the port of destination. 

In a hub, a frame is passed along or broadcasted to every one of its ports. It 
does not matter that the frame is only destined for one port. The hub has no ways 
of distinguishing which port a frame should be sent to. Passing it along every port 
ensures that it will reach its intended destination. This places a plurality of traffic on 
the network and leads to poor network response time. 

A network bridge is the hardware that connects two or more networks so that 
the networks can establish communications. Users within home networks or small 
office networks generally prefer to have a bridge when they have different types of 
networks, information, or files could be shared among all of the computers on the 
networks. 

A router is the device that forwards data packets along networks. A router is 
connected to at least two networks, commonly two LANs or WANs. Routers are 
located at gateway, the place where two or more networks are connected. 

Routers use headers and forwarding tables to determine the best path forwarding 
the packets which use protocols such as ICMP to communicate with each other and 
configure the best route between any two hosts [1]. 

A network switch [13] is a small hardware device that joins multiple computers 
together within one local area network (LAN). Ethernet switch devices were com- 
monly used on home networks before routers become popular; broadband routers 
integrate Ethernet switches directly into the unit as one of the many functions. High- 
performance network switches are widely used in corporate networks and data cen- 
ters. 

Physically, network switches look nearly identical to network hubs. Switches, 
unlike hubs, are capable of inspecting data as messages are received via a method 
called packet switching. A switch determines the source and destination device of 
each packet and forwards data only to the specific device intended to conserve net- 
work bandwidth and generally improve performance compared to hubs. A typical 
connection between network hardware is shown in Fig. 3.1. 


3.1.3 Network Threats and Attacks 


A threat is potential security violation [5]. The potential threats of a network usually 
are from software, hardware, and virtualized system [10]. 

In computer networks [5], an attack is any attempt to destroy, expose, alter, dis- 
able, espionage, or gain unauthorized access or make unauthorized use of an asset. 
Network attack is usually defined as an intrusion on network infrastructure that 
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Fig.3.1 Typical connection of a computer network 


will firstly analyze the environment and collect information in order to exploit the 
existing open ports or vulnerabilities - this may include as well unauthorized access 
to resources [13]. In such cases where the purpose of attack is only to learn and get 
information from the system, but the system resources are not altered or disabled. 
Active attacks occur where the perpetrator accesses and either alters, disables, or 
destroys resources or data. Attacks are performed either from outside of the orga- 
nization by unauthorized entity (outside attack) or from the company by using an 
“insider” that already has certain access to the network (inside attack). Very often 
the network attack itself is combined with an introduction of a malware component 
to attack the targeted systems. 

A passive attack monitors unencrypted traffic and looks for clear-text passwords 
and sensitive information that are used in other types of attacks. Passive attacks are 
comprised of traffic analysis, monitoring unprotected communications, decrypting 
weakly encrypted traffic, and capturing authentication information. Passive inter- 
ception of network operations enables adversaries to see upcoming actions. Passive 
attacks result in the disclosure of information or data files to an attacker without the 
consent or knowledge of the user. 

In an active attack, the attacker tries to bypass or break into secured systems. This 
can be carried out through viruses, worms, or Trojan horses. Active attacks include 
attempts to circumvent or break protection features, introduce malicious code, and 
steal or modify information. These attacks are mounted against a network backbone, 
exploit information in transit, electronically penetrate an entire enclave, or attack an 
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authorized remote user during an attempt to connect to an enclave. Active attacks 
result in the disclosure or dissemination of data files, DoS, or modification of data. 

(DoS) Denial of Service aims at disrupting legitimate use and has two categories: 
hogging the resources and disrupting network. Unusually DoS will slow the network 
performance (opening files or accessing websites), cause unavailability of a particular 
website, have not ability to access any website, dramatic increase in the number of 
spam emails received, disconnection of a wireless or wired Internet connection, 
etc. [10]. 

A distributed attack requires the adversary introduction code, such as a Trojan 
horse or back-door program, a “trusted” component or software that will later be 
distributed to many other companies and the focus of users distribution attacks on 
the malicious modification of hardware or software at the factory or during distribu- 
tion [13]. These attacks introduce malicious code, such as a backdoor of a product 
to gain unauthorized access at a specific later time. 


Insider attacker 

An insider attack such as a disgruntled employee, intentionally eavesdrops, steals, 
or damages information, uses information in a fraudulent manner or denies access to 
other authorized users. Not malicious attacks typically result from carelessness, lack 
of knowledge, or intentional circumvention of security for such reasons as performing 
a task. 


Close-in attack: A close-in attack refers to get physically close to network com- 
ponents, data, and systems in order to learn more about a network. Close-in attacks 
consist of regular individuals attaining close physical proximity to networks, systems, 
or facilities for the purpose of modifying, gathering, or denying access to informa- 
tion. Close-in physical proximity is achieved through surreptitious entry into the 
network, open access, or both. 

One popular form of close-in attack is a social engineering; the attacker com- 
promises the network or system through social interaction with a person by using 
an email message or phone. Various tricks may be used by an individual to reveal 
secure information. The leaked information would most likely be used in a subse- 
quent attack so as to gain unauthorized access to a system or network. 


Phishing attack: In phishing attack, a hacker creates a fake website that looks exactly 
like a popular site such as the paypal. The phishing part of this attack is that the hacker 
then sends an email to trick users to click a link that leads to the fraud site. When 
a user logs on with the account information, the hacker records the username and 
password, then tries that information on the real site. 


Hijack attack: In a hijack attack, a hacker takes over a session between a user and 
another individual, and disconnects the other individual from the communication 
channel. While the user still believes that (s)he is talking to the original party and 
may send private information to the hacker. 
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Spoof attack: In a spoof attack, the hacker modifies the source address of packets 
(s)he sent so that they appear to be coming from someone else. This may be to bypass 
the firewall rules [13]. 


Buffer overflow: A buffer overflow attack is happened when the attacker sends more 
data to an application more than what is expected. A buffer overflow attack usually 
results in the attacker gaining administrative access to the system in a command 
prompt or shell. 


Exploit attack: In this type of attacks, the attacker knows a security problem within 
an operating system or a piece of software and leverages the knowledge by exploiting 
the vulnerability [13]. 


Password attack: An attacker tries to crack the passwords stored in a database or 
a password-protected file [3]. There are three major types of password attacks: a 
dictionary attack, a brute-force attack, and a hybrid attack. A dictionary attack uses a 
word list file, which is a list of potential passwords. A brute force attack is happened 
when the attacker tries every possible combinations of the attacks [15]. 

Our passwords [1] are usually associated with digital devices which have close 
relationship with our work and life, usually we have three types of passwords: 


e Normal Password. Normally this type of passwords have been used in computer 
systems, Internet accounts, etc. A classical example is the authentication interface 
of Gmail (the Google email) system, see Fig. 3.2. 


Fig:3,2. Autiontication One account. All of Google. 
interface of the Gmail 
system Sign in to continue to Gmail 


l | 


Find my account 


Create account 


One Google Account for everything Google 
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Q The name of my first pet? 


A 


m 


L J 
z 
HOGHEEGG00000C00000080000808 


Fig. 3.3 An example of dynamic password based authentication 


Fig.3.4 A graphical pattern 
‘P’ is used as the password 
of a smart phone 


e Dynamic Password. Namely one pad password, it is usually updated timely, syn- 
chronization between a token and servers are required. It provides another layer 
of protection of the normal password systems. We may find these passwords from 
our bank account, ATM machines, etc. See an example from Fig. 3.3. 

e Visual Password: Using visual information for authentication and authorization, 
the cyberspace is much bigger than the traditional cypher text space. Nowadays, 
most of smart phones are using a graphical pattern as the password to login into 
the mobile operating system as shown in Fig. 3.4, several systems even adopted 
user’s fingerprint as the password. 


In order to defend password guessing, we provide below solutions in this section: 


e In password security, we need to change the default password, it is used to thwart 
exhaustive search. 

e In password security, we need to set password format mixing upper- and lower- 
case symbol including numerical and other nonalphabetical symbols in a pass- 
word. 
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e In password security, we should avoid obvious passwords, break the semantic and 
forbid to use person’s name and birthday as the important passwords. 


In password attacks, there are multiple ways including exhaustive brute-force 
attacks, dictionary-based attacks, rule-based attacks, Markov model-based attacks, 
and Password Guessing based on RNNs, etc. 

PassGAN [9] replaces human-generated password rules with theory-grounded 
machine learning algorithms, it uses a Generative Adversarial Network (GAN) to 
autonomously learn the distribution of real passwords from actual password leaks, 
and to generate high-quality password guesses. PassGAN can autonomously extract 
a considerable number of password properties that current state-of-the art rules do 
not encode. 

The idea of PassGAN [9] is to train a neural network so as to autonomously 
determine password characteristics and structures, and leverage this knowledge to 
generate new samples that follow the same distribution. The output of PassGAN 
(fake samples) becomes closer to the distribution of passwords in the original leak, 
and hence more likely to match real users’ passwords. 

In general, regarding to network security, we usually refer to: 


e CIA, confidentiality, integrity, and availability in information security. 


Confidentiality: CIA is a model designed to guide policies for information security 
within an organization. In the context, the confidentiality is a set of rules that limit 
access to the confidential information, the integrity is the assurance that the informa- 
tion is trustworthy and accurate, and the availability is a guarantee of ready access 
to the information by authorized staff. 

Confidentiality prevents sensitive information from reaching the wrong persons, 
while making sure that the right receivers in fact get it. A good example is an account 
number or routing number when banking online. Data encryption is a common 
method of ensuring confidentiality. User IDs and passwords constitute a standard 
procedure. In addition, users take precautions to minimize the number of places 
where the information appears, the number of times which is actually transmitted to 
complete a required transaction. 

Integrity involves maintaining the consistency, accuracy, and trustworthiness of 
data over its entire life cycle. The data must not be changed in transit, and the steps 
must be taken to ensure that the data cannot be altered by unauthorized staff members 
(for example, ina breach of confidentiality). Ifan unexpected change occurs, a backup 
copy must be available to restore the affected data to its correct state. 

Availability is best ensured by rigorously maintaining all hardware, performing 
hardware repairs immediately when needed, providing a certain measure of redun- 
dancy and failover, offering adequate communications bandwidth and preventing the 
occurrence of bottlenecks, implementing emergency backup power systems, keep- 
ing the current with all necessary system upgrades, and guarding against malicious 
actions. 
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e AAA: authentication, authorization, and accounting in computer security. 


Authentication, authorization, and accounting (AAA) are terms of a framework for 
intelligently controlling access to computer resources, enforcing policies, auditing 
usage [1,5], and providing the information necessary to bill for services. These 
combined processes are considered importantly for effective network management 
and security. 

As the first process, authentication provides a way of identifying a user, typically 
by having the user to enter a valid username and password before an access is 
granted [5]. The process of authentication is based on each user having a unique set 
of criteria for gaining access. The AAA compares a user’s authentication credentials 
with others stored in a database [3]. If the credentials match, the user is granted. If 
the credentials are at variance, authentication fails and the access is denied. 

Following authentication, a user must gain authorization for committing a task. 
After logging into a system, for instance, the user may be issued permissions. The 
authorization process determines whether the user has the authority to deal with 
a task. Simple authorization is the process of enforcing policies: determining what 
types or qualities of activities, resources, or services the user is permitted [5]. Usually, 
authorization occurs within the context of authentication. Once a user is authenti- 
cated, (s)he may be authorized for different types of access or activity. 

The final plank in the AAA framework is accounting, which measures the 
resources a user consumes during access. This includes the amount of system time 
or data that a user has sent and/or received during a session. Accounting is carried 
out by using the logs of session conversations and usage information which is used 
for authorization, billing, trend analysis, resource utilization, and capacity planning 
activities [1]. 


e MARS: monitoring, analysis, and response in network security. 


Security monitoring, analysis, and response system (MARS) provide security 
monitoring for network devices and host applications supporting both Cisco and 
other vendors. Security monitoring [1] with MARS greatly reduces false positives 
by providing an end-to-end topological view of the network, which helps improve 
threat identification, mitigation responses, and compliance [13]. 


PSF: Protection, surveillance, and forensics in data security. 


Protection refers to cryptographic level which is related to network security and 
data security. Surveillance concentrates on behavior and scene monitoring [13]. Net- 
work forensics is a subbranch of digital forensics relating to monitoring and analysis 
of computer network traffic for the purposes of information gathering, legal evi- 
dence, or intrusion detection after incidents. Unlike other areas of digital forensics, 
network investigations deal with volatile and dynamic information. Network traffic 
is transmitted and then lost, so network forensics is often a proactive investigation. 
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3.1.3.1 Strategies for Network Security 

From management viewpoint, we hope a network is well configured and has been 
setup properly, the network is expected to have the capability for protecting them- 
selves and providing responses to any potential attacks. Ideally, we protect our com- 
puter networks using the below strategies: 


e Guard the outside perimeter: We utilize firewall and other hardware devices to 
separate a computer network from others [13]. This makes the problems such as 
computer virus from the inside that could not affect our computers; meanwhile, 
the outside hazards could not penetrate the protection and keep the computer 
inside the premises working well. 

e Control internal access: For the most of computer networks, the problems or 
attacks usually are from the inside; therefore, internal access should be controlled 
for all the operations. The logs for monitoring networks should record any events 
happened in the computer network; the necessary event analysis and prediction 
are needed to reduce the threats from inside of the network [2, 13]. Authorization 
and authentication are crucial at initial stage of protecting networks; accounting 
and auditing help us to trace back for computer security [5]. Reporting and logging 
on a network could help us to record the events happened in the networks [5]. 

e Police the traffic: The police could reduce network traffic and simultaneously pub- 
licly and openly guide authorized staff and users what actions are right and should 
be allowed inside a computer network. The strict policies reduce the crowded haz- 
ards and reasonably utilize the hardware and software resources, consequently 
make sure all the networking parts working in highly efficient way. 

e Secure the devices: The server rooms and power rooms should be absolutely 
secured. Irrelevant staff should keep clear from these rooms with surveillance 
data and refrain from the devices inside important places. For the servers and 
power devices, the policies and regulations should be regularly explained and 
followed [5]. The input and output devices of computers including keyboards, 
mice, monitoring, printers, etc. should be forbidden to be touched and be locked 
in free time, any operations should be observed by more than two onsite staff 
members so as to reduce any security mistakes [1, 13]. 

e Use self-defending networks: Computer networks should have the self-responding 
and defending ability so as to reduce any reproduced attacks and possible hazards, 
avoiding cascade failures [5]. Once any parts of the network are out of order in 
running, the network should separate out the portions and make sure them working 
well [1]. 

e Policies, procedures, standards guidelines: These regulations and guidelines 
always remind the staff operating under the instructions in standard and keep- 
ing alert to any risks. The regulations will take effects to new employees and the 
unauthorized persons keep distance from the perimeter. 
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e Training and awareness campaigns: The staff members who have the opportuni- 
ties to touch the computer networks should be trained regularly and update the 
knowledge periodically. The campaign will help the staff to exchange experience 
and learn how to respond to any incidents. 


3.1.3.2 Network Event Monitoring and Analysis 

An event is the occurrence within a computer system that converses with other sys- 
tems or users [2]. Computer and network systems contain event logs that encompass 
enormous amount of data. These event logs hold records of any behaviors or actions a 
network device performs. Events may also involve illegal activities such as malicious 
attacks. The plan is to assemble these events and examine their relationships with 
research and recording each activity into a knowledge database. This information 
will help to avoid further incidents or risks after the events have occurred. 

The goal of network event security analytics is to provide a mechanism into 
gathering network data so that it may be analyzed for different purposes to improve 
network security and performance by viewing packets traverse across the network [4]. 
The ability to determine what kind of traffic enters a network and what effects have 
on a network can help administrators to come up with a viable solution to problems 
they encounter. 

Network analysis tools are helpful to test the strength and vulnerabilities of a 
network because a network cannot be deemed secure without testing. These tools 
depend heavily on the OSI model that allows for communications between differ- 
ent network layers so that relevant information can be gathered. Investigations into 
this problem are helpful to identify methods of extrapolating information from the 
systems by implementing a listening techniques or an activity tracker within a sin- 
gle computer or multiple computing systems to give detailed feedback about what 
occurs within a system and how we can display this information in a presentable 
user interface that does not overwhelm the user with useless information. 

Event monitoring software commonly refers to packet sniffer or logging applica- 
tions because of its ability to scour network traffic and gain data about what is being 
transferred over a network. The ability to analyze traffic sent between two networked 
machines is common job for a network administrator to track down faults or suspi- 
cious activity within a network. Software integrates itself into network adapters so 
all traffic that passes through the adapter available for viewing. Each packet going to 
a computer is considered as event by a packet sniffer and can be viewed in real time. 

Event logging security component such as firewall is a network tool that is pur- 
posely built to permit deny both ingoing and outgoing traffic to and from an orga- 
nizations’ network. Firewalls are developed through the usage of different Open 
System Interconnection (OSI) layers. Each layer handles different pieces of infor- 
mation, which allows firewalls to obtain a complete picture of information such as 
applications, networks, and transports. There are various types of firewalls which 
include packet filtering firewalls, proxy firewalls, address translation firewalls, and 
application layer firewalls [13]. The differences between these firewalls are that they 
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serve different layers of the OSI model and are suitable for specific tasks within a 
network. 

By default, firewalls follow a rule set that must be configured by a network admin- 
istrator for appropriate traffic and if nothing is configured they are in a deny mode 
when activated. Firewalls are typically placed at network boundaries, which sepa- 
rate the private domain and public domain of an organization network [13]. This is 
done so all traffic must be inspected before entering or leaving the network mitigat- 
ing the risk of malicious network attacks that can occur through bypassing security 
measures [1]. 

Firewalls log any network events into a file or a built-in event tracker for under- 
standing the causes of why a traffic is permitted and others are not. This is helpful 
for troubleshooting and creating solutions to solve problems that occur [13]. 

Files accessed are scanned in real time to prevent any code execution of infected 
files. Alternatively, users can schedule or run scans of an entire computer to find 
inert files that may not have executed but contain malicious code. Any file access 
triggers the anti-virus into its scanning mode before the file is permitted for run-time 
execution. Such events are continually logged into the anti-virus and permit it to 
track files that have been verified as safe [13]. 

Information such as source, destination, port, encapsulated data, and header can 
be extracted from packets. The ability to view this data is valuable because admin- 
istrators can make decisions about what is going on within their network and take 
actions when necessary [2]. 


3.2 Data Scrambling and Descrambling Technologies 
3.2.1 Cryptographic mod Operation 


In cryptography [1], we have the modulo operation mod which is circular. The promi- 
nent property of this modulo operation is shown below. Given x € Z, 


0 mod x = 0; 
l mod x = 1; 


x— l modx=x-— l; 
x mod x = 0, 
x+ l mod x = l; 


For the 26 English letters, we scramble their alphabetic order using, 


y = (x + s) mod 26 (3.1) 
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An image 156x156x160 


Fig.3.5 Image scrambling using Arnold’s Cat Map 


where x is the position of a letter in the sequence, s is a constant for shifting, and 
y is letter position after scrambled. For example, the letter ‘a’ is located at the first 
position x = 0, after shift s = 5 position using the modular operation, the letter ‘a’ 
will be located at the position y = 5. If all the letters of a string have been shifted in 
this way, the string will be scrambled or encrypted. 

When scrambling digital images in spatial domain, Arnold’s Cat Map is: 


[>] = | | [5 | moa w (3.2) 


where x, y = 0,...,N — 1, and N is the width and height of the image. The scram- 
bled image with resolution 156 x 156 is shown in Fig. 3.5. 
For scrambling digital images in color space we use, 


r’ 111 r 
g |=] 122 g | mod c (3.3) 
b' 123 b 

where r, g, b = 0,...,c— 1, and c is the maximum color number of the image, 


usually c = 2”, n is the depth of a color bitmap. An example of video scrambling is 
shown in Fig. 3.6. 
For scrambling digital audio clips, a typical equation is shown in Eq. (3.4): 


n 
T= X` akTmk (3.4) 
k=1 


1,j=(m-i+k)modn 
where Tk m = {tij }nxn> tij = te j= a ; 
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288 x 288x 274,10.96sec 288 x 288x 274,10.96sec 


Fig. 3.6 Scrambling in spatial and color domain 


Fig.3.7 Scrambling in spatial domain 


Suppose we have a vector V = (vj, v2,..., vn) , an identity matrix is J = 
diag(a, Q2,...,n),0; = 1,i=1,...,n, apparently, V’ = V=1I-V=V-I.Now 
if we swap the two rows i and j (i < j) of the matrix I, we have, 


1 0-:--. 0 0 
00...10 
I E E E eee (3.5) 
01...00 
00...00 1 
Correspondingly, we have V’ = (v1, v2, ..., Vi-1, Uj, Uipls eee. Vj-ls U UjHls ve 


Un) after V’ = T - V, hence the vector V has been scrambled to the vector V’. Without 
loss of generality, we assume all the audio clips could be segmented into the vectors 
V,n = |V| < œ. 

Of course, this equation V’ = U - I’ - V could be generalized for scrambling digital 
images or videos U = (a1, a2, ..., Gn). If we properly set the row numbers i, j and 
different values for a, #0, ax € Z, k =1,...,n, the cypher space will be much 
bigger. An example of video scrambling is shown in Fig. 3.7. 
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Fig.3.8 Progressive image scrambling at multiple levels 


Progressive scrambling of digital images is defined as, 


1 
lim So (Mı, M2) = lim — Amin (2) = 0 3.6 
dim Sia, M2) inoa 2, min(@) = € > (3.6) 
Amin (2) = min({]2 — 2'|, |2| = |2"|, V2! C M), 2CM 6D 


where $}2| (M1, M2) is called scrambling distance between a given media M and its 
scrambled one M> under the resolution |92| [22]. 
We decompose the media in multiple resolutions, 


Di(m) = D(E)(m) = Di((C1, C2, . . . , Cal (m)) (3.8) 


where D;(m), l = 1,2,..., L is a decomposition of the media m at level /, C; are 
the coefficients after the decomposition. The components E = (C1, C2,..., Cn), i = 
1,2, ..., n and the given random keys will be operated together using the operation 
®&). The operation was chosen usually for the sake of its simplicity of description 
and implementation. 


E;n) = Eim) &%) Ki (3.9) 
10m) = Di(E})(m) (3.10) 

Eim) QK = (Cy, Cay, ---, Cain) %) Kr = [Cr ©) Ki, Cor Q) Ki, -~ Cu Q) Kim) 
(3.11) 


Therefore, we define a tensor product &, it is distributive and operative on each 
element iteratively shown in Fig. 3.8. 

For example, the zigzag ordering of DCT components in JPEG could be defined 
as a tensor product and applied to 8 x 8 blocks of images for multiple resolutions 
hierarchically and iteratively in spatial or frequency domain as shown in Fig. 3.9. 
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(a) 


Fig. 3.9 Image progressive scrambling using zigzag reordering. a the original image and b the 
scrambled image 


Mathematically, the zigzag ordering could be implemented by using the elemen- 
tary matrix. Suppose the coefficients c; and cj, (i Æ j) will be swapped in zigzag 
ordering, the i-th row and j-th column of the identify matrix will be swapped simul- 
taneously so as to implement this operation. The descrambling is exactly to use the 
same matrix as 


K-K=1,K=K'=Kk"! (3.12) 


(Cy, Co,..., Gj, ..., Cirsa Cn) = (C1, C2, ..., Gi, Csee Gn) K (3.13) 


100000000 
010000000 
00---0 000 00 
001000000 
K=|]0000---0 0 00 (3.14) 
000000100 
00000 0---00 
000000010 
000000001 


where K is regarded as the key for progressive scrambling. For each 8 x 8 block, we 
set different types of keys. This re-ordering operation-based scrambling also is able 
to be applied to frequency domains such as Discrete Cosine Transform (DCT) and 
Discrete Wavelet Transform (DWT). 
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Level 1 Level 2 Level 3 


Fig.3.10 Progressive scrambled images (512 x 512) in DCT domain for the images Lena, Mona 


In spatial domain, we segment an image into 8 x 8 pixel-size blocks and scramble 
them using zigzag ordering. On the basis of the first round of scrambling, we regard 
each 8 x 8 pixel-size block as a unit and scramble 8 x 8 such blocks as the second 
level scrambling. After all the blocks have been scrambled in this hierarchical way, the 
entire image has been scrambled. The descrambling is the exactly reversal procedure 
of this progressive scrambling. The results are shown in Fig. 3.10. 


3.2.2 Cryptographic xor Operation 


In cryptography [1], another very crucial operator is xor. Since xor is the exclusive 
or operation in logic, we have, 


1 xor 1=0; 
0 xor 0 = 0; 
1 xor O= 1; 


0 xor 1=1; 
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Thus, the salient property of this operation is, Ifa xor b= c, thena xor c = b 
andc xor b = a, where a, b, c are binary numbers (fo, 1, ..., Bn)2 = vo Êi- vie 
6; =0,1;i=0,...,n. 

Usually, encryption algorithm consists of set of K keys, set of M messages, and 
set of C ciphertexts (encrypted messages). Alice is the message sender A, Bob is the 
message receiver B, and Eva is the assumed eavesdropper. Symmetric encryption 
adopts the same key to encrypt and decrypt E(k) from D(k), and vice versa. For 
example, DES(Data Encryption Standard) is symmetric block-encryption algorithm, 
encrypts a block of data at a time using the Feistel function (expansion, key mixing, 
substitution, and permutation). Since it is not secure, triple-DES (3DES) has been 
employed and considered as more secure. The similar encryption algorithm includes 
AES (Advanced Encryption Standard) [15,17,19]. 

RC4, created by Rivest Cipher in 1984, is the most powerful symmetric stream 
cipher, but it was known to have vulnerabilities. It encrypts/decrypts a stream of bytes 
(i.e., wireless transmission). The key is an input to pseudorandom-bit generator which 
could generate an infinite key stream [15, 17,19]. 

Asymmetric encryption is based on two keys, the published key is used to encrypt 
data while the private key is used to decrypt the data encrypted by using its published 
public key. Alice (A) transmits her key (n, e) to Bob so as to encrypt the plaintext 
M, 


C = M° (mod n) 


When Bob receives encrypted message C, he uses the private key d to get the 
message, 


M = C? (mod n) 
For example, the public key is (n = 3233, e = 17), 


C = M" (mod 3233) 
The private key is d = 2753, 
M = C” (mod 3233) 
If the message M = 65, then, 
C = 65" (mod 3233) = 2790 


To decrypt, 
M = 279073 (mod 3233) = 65 
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The security of this scheme is guaranteed by the Chinese remainder theorem 
(CRT), 

d-e=\1(mod ¢(n)) 
where 
b(n) = (p—1)-(q—-1) 
and 
o(n)=p-q 


Note: p and q are two very big prime numbers [15,17,19]. 
Formally, Chinese remainder theorem (CRT) is described as: 


If n; € Zt, ni > 1,i=1,2,...,k are pairwise co-prime and a; € Z, 0 < ai < nj, 

then there exists an integer r such that congruences r = a;( mod n;) (i = 1, 2,..., k). 
k 

T= X aiMiNi mod N (3.15) 


i=1 


where N = M, ni, Ni = a NiM; = 1(mod nj). 

For example, k = 3, nı = 7, m = 11,3 = 13, N = 1001; 

Nı = 143, M =91, N3=77; Mı =5, M = 4, M3 = 12. If r=5(mod 7), 
r = 3(mod 11), r = 10(mod 13), namely, a; = 5, az = 3, a3 = 10, then 
r=(5 x 143x5+3x91x4+410x 77 x 12) mod 1001 = 894. 

For digital images and videos, we could use xor operation to pixel values, pixel 
blocks, and video frames with the given keys so that the digital media could be 
scrambled. However, only using xor in color space for visual scrambling is not 
sufficient because we may perceive the object shapes, edges, or contours on the 
scrambled images. 

Compared to those existing cryptographic or scrambling techniques based on 
plain text such as substitution, transposition, DES, 3DES, AES, RSA, and EEC, the 
image scrambling approaches have much wider encryption space instead of only 
taking the ASCII codes into consideration. On the other hand, the cryptographic 
techniques may have the problems that could not scramble a digital image bxecause 
the coherence existing between the row and column or blocks of a digital image 
could not be broken [6, 16]. 

Hilbert space-filling curves shown in Fig. 3.11 have been used to fill up the image 
space with various resolutions because of its famous features such as self-similarity 
and multiple resolutions. Once the pixels on one of such curves are sorted in order, the 
digital images will be scrambled in spatial domain, this has been applied to encrypt 
digital TV signals. Further scrambling techniques could be combined with the ones 
from frequency domain and compression domain of digital images together. 

The HSC is with a fractal structure generated by a recursive production rule 
which has the property of self-similarity and satisfies IFS system, its dimension is 
a fraction supported by the fixed-point theory. After several rounds of recursions 
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Fig.3.11 Three rounds of — 
hilbert curves 


started from the fractal generator, the curve will fill up a given space recursively. 
There are multiple fractal rules to generate HSC curves. In this scrambled image, 
only the pixel locations have been re-ordered, the pixel color values still hold. 

For each block, we recursively generate a corresponding HSC curve using the 
generator, the curve starts from very beginning shown in red color, its mouth points 
in upward. In the second iteration, for each turning point including starting and end 
points, we generate the same shape; however, the size and orientation will be changed. 
At the starting point, we rotate the generator 90° toward the left (anti-clockwise); 
at the end point, we turn the generator to right for 90° (clockwise), at the other two 
turning points, the two orientations of the generators are the same. The procedure is 
described as Eq. (3.16), 


p(x, y) = HSC(p(x, y)) (3.16) 


where HSC(-) is the iterative function, p(x, y) is the turning point on the curve, the 
stop condition for this recursion is the final resolution reached so as to fully fill the 
given space, the Hausdorff dimension of this fractal curve is 2.00. 

The image scrambling operation is described as, 


I'(x, y) = HSC(U(x, y)) mod W (3.17) 


where J is the previous image without scrambling, /’ is the scrambled image, and W 
is image width. After generating this HSC curve, we sort the pixel order using the 
pixel sequence on the HSC curve shown in Eq. (3.17). Equation (3.17) first converts 
2D points into 1D order; then, the 1D sequence will be used to fill up the image 
space line by line from top to end, consequently the image is fully scrambled which 
is exported as the encrypted image. 

The encryption and decryption algorithms using HSC-based image scrambling in 
DCT domain change the pixel sequence spatially, but do not alter color values of 
image pixels. Therefore, they do not influence the visual effects of an image. 

Figure 3.12 shows one of the results of image scrambling using the HSC-based 
image scrambling in DCT domain with different resolutions. We typically yield a 
HSC curve using recursion in fractal, then the DCT coefficients (DC and AC) within 
the blocks are reordered. Inverse DCT (IDCT) has been used to transfer the scrambled 
coefficients back to spatial domain. Since DCT and IDCT are based on orthogonal 
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Fig. 3.12 Image encryption for Lena (512 x 512) in DCT domain using HSC scrambling for the 
blocks with different size a block size 16 x 16 b block size 32 x 32 ¢ block size 64 x 64 


decomposition of visual signals of block-based images, this decomposition ensures 
that after encryption and decryption, the original image is able to be reconstructed 
in blocks visually. 

The security of this HSC encryption is ensured by using keys and the scram- 
bling algorithm. This is the reason why the HSC generator has multiple choices, and 
could be rotated along the clockwise and anti-clockwise directions, the generator has 
four orientations. Based on different generators, the HSC curves will be completely 
different. Meanwhile, the image block has multiple choices with various resolu- 
tions, the scrambling based on different block sizes will lead to image encryption of 
different strengths. Larger the bock size, stronger the encryption. Therefore, what 
HSC has been selected at which resolution will be the secret key of the encryption 
algorithm [23]. 

The HSC-based image scrambling is different from traditional encryption algo- 
rithms, such as RSA, ECC, and secret sharing [15,21]. The reason is that the scram- 
bling completely destroys the order of pixel locations in the image; therefore, the 
pixel operations such as edge extraction, SIFT, and others are not possible anymore 
(especially in DCT domain) [11]. 


3.2.3 Image Encryption on Frequency Domain 


Image encryption techniques can be divided into two groups, the first one encrypts 
images on spatial domain; the other encrypts images on frequency domain [14]. 
Most of image encryption schemes focus on frequency domain, various methods 
based on Fourier Transform, discrete cosine transform (DCT) and discrete wavelet 
transform (DWT) have been widely applied to image encryption. Because DCT 
can avoid complex calculation compared to traditional DFT, DWT can obtain local 
properties of the input image on both spatial and frequency domains, which provides 
convenience for image encryption [8]. 

Discrete cosine transform (DCT) was widely used in image encryption. In 2010, 
a color image encryption algorithm was proposed based on Arnold transformation. 
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DCT was chosen because the pixel values of a given image are defined in real domain 
and the matrix still remains in the real domain after the transformation. On the other 
hand, the partition operation was applied before Arnold transformation so that the 
operation can be clearly conducted. As a result, the security of the encrypted images 
was improved successfully. 

In 2013, a multi-image encryption algorithm was proposed based on cascade frac- 
tional Fourier transform. In this method, the input images are successively encrypted 
using a series of encryption keys until a final encrypted image is obtained. The algo- 
rithm not only works for the encryption of multiple images, but also is very secure. 
Because there are so many secret keys, the algorithm can be applied to multi-user 
authentication. 

In 2000, a generalized image encryption algorithm based on fractional Fourier 
transform was proposed, and this method used a new generalized fractional Fourier 
transform instead of random phase mask. In this algorithm, the period of fractional 
Fourier transform is extended to any integer; thus, the period and transformation 
index are regarded as two secret keys. 


3.3 Data Secure Transmission over Internet and Networks 


In this section, we will introduce those HTTP servers, FTP servers broadcast those 
captured surveillance videos and audio around the world; meanwhile, no matter 
where we are, we can receive those audio and video like the famous YouTube website. 
Provided those cloud servers are utilized, the surveillance images, video and audio 
clips could be pushed to our accounts. 

A typical FTP software is WinSCP shown in Fig. 3.13. If we have big surveil- 
lance data, we may upload them to a Unix, Linux, or Microsoft Windows server; 
traditionally, FTP software is designed to complete this work. Currently Microsoft 
Windows provide network services, we could upload the surveillance media to those 
server machines. 

The typical HTTP server for uploading and playing videos based on Microsoft 
Windows is Apache though the operating system (OS) has its own ISP services [18]. 
The EasyPhP software integrated Apache and MySQL together for database manage- 
ment using PHP programming language, it provides great convenience for Internet 
users. Users could link videos to this website and utilize the possible players to play 
the videos like YouTube website (Fig. 3.14). 

Hypertext transfer protocol secure (HTTPS) is an extension of the hypertext trans- 
fer protocol (HTTP) for secure communication over a computer network. In HTTPS, 
the communication protocol is encrypted using transport layer security (TLS) and 
secure sockets layer (SSL). HTTP is not encrypted, which can let attackers gain 
access to website accounts and sensitive information. HTTPS is considered secure. 
HTTPS must create a public key certificate for the web server. The site administrator 
typically creates a certificate for each user, a certificate is automatically checked by 
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Fig. 3.14 The software EasyPhp is running as the http server 
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Fig.3.15 Interface of the software Active for video broadcasting 


the server on each reconnect to verify the user’s identity, potentially without even 
entering a password. 

The media streaming software includes Active Webcam, Microsoft Express 
Encoder, and Adobe Media Encoder, these software could create a streaming server 
and transmit streaming monitoring data via Internet. At receiver sides, a web browser 
is sufficient enough, they are one to many broadcasting systems. 

Active WebCam software shown in Fig. 3.15 broadcasts live MPEG-4 video 
stream and MP3 audio signals up to 30 frames per second, the software is able 
to, 


e Monitor our home or office while we are away 


e Support encrypted transmissions 
e Send email and SMS when motion detected. 


3.4 Questions 


Question 1. What are functionalities of each layer of a computer network? Which 
one is directly related to our end users? 


Question 2. How to set a laptop or mobile phone as a hotspot? 
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Question 3. How to set a multimedia streaming server? 


Question 4. What are the typical attacks of a computer network? What hardware 
and software are used to anti-attacks? What is VPN? Can you give an example? 


Question 5. Which algorithms are often employed for image and video scrambling 
and descrambling? 
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Surveillance Data Analytics 


4.1 Object Computable Features 


In digital image and video processing, we usually start from a single pixel of a raster 
display; pixel color is represented by the bits composed of binary value ‘0’ or ‘1’; 
usually, we use red(R), green(G), and blue(B)—three channels to display one pixel; 
each independent channel has 8 bits or 1 byte. Sometimes, we also need the alpha 
channel for transparent or opaque display, which occupies another byte. Therefore, 
we totally use at least 4 bytes or 32 bits to display one real-color pixel in general. 


e Image Color 

Image colors are comprised of binary color, pseudocolor, grayscale color, and real 
color. Binary color only uses ‘0’ or ‘1’ to present the colors of a binary image, 
namely white and black. Pseudocolor using an index number presents a pixel color 
since the display cache could not hold too many colors simultaneously at early stage 
of display technology. Grayscale refers to the three channels RGB having the same 
intensity, namely Z = R= G = B, I =0,1,2,..., 255, simply, 


R+G+B 


=| 3 


] (4.1) 
where |-] is the floor function. Regularly, converting a color to a grayscale image, 
we use the Eq. (4.2) 


I=|a-R+B-Gt+y-B| (4.2) 


where æ + 6 + y = 1.0, a, 6, y € (0, 1.0) for example, in terms of the CIE 1931, 
the linear luminance J is given by œ = 0.2126, 6 = 0.7152, and y = 0.0722. 

Real color means so many colors are shown on a display and our eyes could not 
distinguish differences between the colors from real world and ones from our display. 
At present, many consumer products present retina display no matter in pixel size or 
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colors. Displaying more colors means we have to use a larger buffer to hold all colors 
of one image or one frame of a video, and all the pixel colors should be popped up 
on a screen simultaneously; that means in the display there is no time delay between 
the first pixel to be turned up till the last one. 

e Color Spaces: RGB, YCbCr, and YUV 

RGB is one of the fundamental color schemes in raster display; the three channels 
are closely adhered together linearly. Usually, we map colors from one space to 
another, such as the color space YCbCr, YUV, HSV, etc.; the colors in different space 
show their distinct properties. After the color space conversion, the color distribution 
and the linear relationship between these color channels are much reasonable. The 
color conversion between the different spaces is unique like Eq. (2.11) in image 
compression. For example, the conversion equations between RGB and Y’UV color 
spaces are shown in Eqs. (4.3) and (4.4), 


y’ 0.299 0.587 0.114 R 
U | =| —0.14713 —0.28886 0.436 G (4.3) 
V 0.615 —0.51499 —0.10001 | | B 
R 1 0 1.13983 ] [y’ 
G | = | 1 —0.39465 —0.58060 | | U (4.4) 
B 1 2.03211 0 V 


e Histogram 

Histogram is a statistical concept which means how many pixels have the same 
intensity for a bin, and the bin number is 8, 16, 64, or 256 usually. For a RGB image, 
we make the statistics in each channel respectively. The statistical result shows the 
distribution of pixel intensity of one channel in an image as a bar diagram; the concept 
“entropy” is derived from histogram as Eq. (4.5) that shows information capacity of 
this image [12]. 

For the same scene with a slight change, the color distributions of each image 
should not have too many changes; therefore, the histograms have not too many 
differences. The histogram is regarded as one of the very robust statistical features 
in image search and retrieval or object tracking [9]. 


b 
ED) =- Y hi- Inhi (4.5) 
i=l 


where b is the bin number of the image histogram (Fig. 4.1). 

If an image or some parts of this image are too dark or bright, we are able to get 
the “hidden” visual information using Histogram Equalization. Suppose the least 
bin number is bmin, the greatest bin number is bmax, all the bin numbers between 
bmin and bmax hence are calculated: 


bi — bmi 
b = — ~. 3B; (4.6) 


Dmax = Dmin 
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Fig.4.1 Histogram of the grayscale image Lena (512 x 512) 


h(b’) = h(bi); (4.7) 


Equations (4.6) and (4.6) show the pixel colors only within a specific interval 
which have been stretched to the full range from 0 to B. 

For histograms Q = (i HŽ, oe 4H) and D= (He He: be ., HË), the inner 
product or dot product is, 


b 
< Q, D >= Q: D = |Q] - |D| cos(«) = pees - HÊ) (4.8) 
i=l 
b 4 pd 
das <Q0,D> QD _ Xi (H; Hf) ia 


OIDI OD Te ane Sen? 


The histogram distance is, 


b 
AH(Q, D) = È |H; — Hf (4.10) 
i=1 
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(a) (b) 


Fig. 4.2 An example of two images with the same histogram distributions 


The normalized distance is, 


b 


AHQ, D) =, 


i=1 


|H? — H| 7 
—— aa es (4.11) 
max(H; , H‘) 


For two images having different contents but with the same histogram as shown in 
Fig. 4.2a, b, we need to utilize further fine-grained approaches to find the differences 
between them. 

e LBP 

The local binary pattern (LBP) also examines spatial relationship of pixels for tex- 
ture analysis where the adjacent [71] pixels are converted to binary codes by using 
grayscale value of the center pixel as a threshold. Accordingly, LBP histogram was 
adopted as a feature descriptor of texture feature [79]. 

In MATLAB, LBP feature vector returns as a 1 x n vector of length n representing 
the number of features which depends on the number of cells in an image. The 
function partitions an input image into nonoverlapping cells. When the cell size is 
increased, local details are lost. 

e Texture 

An image usually has texture from clothes, curtain, carpet, etc. Texture analysis of an 
image refers to characterize the regions of images by using texture content [65,91]. 
The texture analysis quantifies intuitive qualities described by terms such as rough, 
smooth, silky, or bumpy as a function of spatial variation in pixel intensities. The 
gray-level co-occurrence matrix (GLCM) considers the spatial relationship of pixels 
in examining texture. The GLCM functions characterize the texture of an image 
by calculating how often pairs of pixel with specific values in a specified spatial 
relationship occurred in an image. An example of GLCM-based flame detection is 
shown in Fig. 4.3. 
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Fig. 4.3 An example of flame detection 


In Fig. 4.3, we firstly use a model concerned with histogram and color to depict 
the candidate fire flame region. Flame region is a group of pixels (blob) with a high- 
intensity value. After the training process, we could segment this kind of regions by 
using a threshold with the assistance of color histogram. Meanwhile, we also observe 
that a flame region differs from that of others [41]. The most significant one is the 
color feature. Although it will cause a lot of errors, it is sufficient to detect all kinds 
of flames. We can set other constraints to define the flame regions so as to remove 
the wrongly detected ones. In the end, the inner part is bright which is close to white 
and the verge has various colors. This feature is a powerful rule that can improve the 
detection performance dramatically. 

Secondly, we deploy Gabor texture, Tamura texture, and GLCM texture to verify 
the detected regions by using texture analysis. The details of the detected texture 
are defined by using different types of flames because the previous methods may 
be confused with the target which has the similar colors or other kinds of lighting 
conditions. 

On the basis of Fig. 4.3, the method is highly flexible and robust which can be 
applied to almost any condition as long as the camera is fixed and stabilized. This 
method is reliable in all kinds of complex conditions and is fairly adaptive at detecting 
fire-color objects. According to the above descriptions, it is able to be applied to a 
multitude of environments for fire flame detection. 
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In MATLAB, GLCM texture feature is presented by: 


e Contrast. A measure of intensity contrast between a pixel and its neighbor over the 
whole image which calculates the local variations in the gray-level co-occurrence 


matrix 
=>) oG-/? 1G). (4.12) 
tJ 


e Correlation. A measure of how a pixel is correlated to its neighbor over the whole 
image which shows the joint probability occurrence of the specified pixel pairs. 


Ny GE MIG Hj) Dy 
Cr = 2 2 aa cae TE (4.13) 


where u; and uj are means, o; and oj are variances. 
e Energy. A sum of squared elements which provide the total of squared elements 


in GLCM, 
= 5 + I(i, j}? (4.14) 
i j 


e Homogeneity. A measure of the closeness which evaluates closeness of the element 
distributions in GLCM to its diagonal 


ICi, j) 
PEE (4.15) 


Tamura texture. A texture pattern is described by using coarseness and brightness; 
the values are put into a vector or used as elements of a vector. 

Gabor texture. A texture pattern is described by using those coefficients of Fourier 
transforms from different layers. 

2D Fourier transform maps a scalar image J from spatial domain into a complex- 
valued Fourier transform ¥ on frequency domain. 


W-1H-1 


F(u, y ToS, Yie yxp | i27 E + a] (4.16) 


r, 


where u = 0,..., W — 1 and v = 0,..., H — 1, i = „/—1 as the imaginary unit of 
complex numbers, W is the image width, and H is the image height, respectively. 

Inverse 2D DFT maps a Fourier transform F on frequency domain back into the 
spatial domain 


W-1H-1 


I(x,y) = $ > Flu, vexp [ia (= Bs 2>) ! (4.17) 


u=0 v=0 
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There are three properties of the DFT [54]: 


e Symmetry property: ¥(W — u, H — v) = F (—u, —v) = F (u, v)* 
W-1H-1 

e Direct current/mean: F (0, 0) = vH D ¥ I&,y) 
x=0 y=0 

e Parseval’s theorem: wi 3 Mæ, y)? = 3 |F (u, v)|? 


where ‘x’ is the conjugate of complex numbers and £2 is a region of the image block. 

Examples of Fourier transform for images using MATLAB are shown in Fig. 4.4, 
where the left column lists all images and the right column shows the transformed 
image on frequency domain, correspondingly. 

Fourier transform maps an image from a real number space to a complex number 
one (with real and imaginary parts). We put the coefficients from different layers 
together to form a feature vector. The vector is one of the computable features of 
an object. The Fourier transform of Gabor filter’s impulse response is the convolu- 
tion between Fourier transform of the harmonic function J’ = J x C(-) and Fourier 
transform of Gaussian function [50], 


x2 ma y?y? x! 
C(x, y; 4,0, Y, 0, V) = exp =) exp l: (7 + »)| (4.18) 


202 


where x’ = xcos6 +ysin@ and y’ = —xsin@ + ycos@, J is the image used for 
convolution operation. 

In Eq. (4.18), A represents wavelength of the sinusoidal factor, 6 represents ori- 
entation of the normal to the parallel stripes of a Gabor function, wy is phase offset, 
o is sigma/standard deviation of the Gaussian envelope, y is spatial aspect ratio and 
specifies ellipticity of the Gabor function. 

e Edge/Shape 

In computer vision and digital image processing, we usually need to extract the 
edges or shapes of objects from an image [61]. The Canny (5 x 5), Sobel (3 x 3), 
and Roberts (2 x 2) operators usually help us extract edges of an object because at 
the regions near edges, colors usually change very dramatically [39,54]. The opera- 
tors are based on gradient or color changes of the designated regions. The magnitude 


usually is calculated as G = G2 + G?; the phase is noted as 0 = arctan(Gy, Gx). 


(1) Canny operator 


ic 

G=- 0 0 0 (4.19) 
2) 1-1-1 
i [-101 

Gy=5|-101 (4.20) 
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Fig.4.4 Examples of Fourier Transform using MATLAB 


An example of edge extraction using Canny operator from OpenCV is shown in 
Figs. 4.5 and 4.6. 
(2) Sobel operator 


G=] -202 (4.21) 
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(3) Robert operator 


e Gradient 


=İ=2 =j 
Gy=| 0 00 (4.22) 
{ 2 1 
+1 0 
“l| n >| (4.23) 
foy 
oe 


Gradient reflects the gradual changes of pixel colors from one region to another, it 
is the steepest descent direction, gradients usually include three types: horizontal, 


vertical, and both. 


v-(2.7 


ax” L) = (Vx, Vy) (4.24) 
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Fig.4.6 Image after using Canny edge extraction from OpenCV 


Discretely, 
(Ahy, Aly) = (I(x + 1, y) bijai I(x, y), I(x, y È 1) = I(x, y)) (4.25) 


Histogram of oriented gradients (HOG) is a feature descriptor in computer vision 
and digital image processing for the purpose of object detection [54]. The technique 
counts occurrences of gradient orientation in located portions of an image. The HOG 
feature shows the direction distributions with dramatic changes. A pixel usually is 
thought having four-connected or eight-connected directions, as shown in Fig. 4.7. 

HOG is to derive a descriptor for a bounding box of an object candidate which 
applies intensity normalization and a smoothing filter to the given image window I 
meanwhile derives the gradient magnitudes and angles for each pixel. A magnitude 
map Im and an angle map J, will be generated. In order to obtain a voting vector, maps 
Im and I, are employed to calculate magnitude values into direction bins. Normalized 
voting values are used for generating a descriptor so as to augment all block vectors 
consecutively and produce the final HoG descriptor. Figure 4.8 is an example of 
HOG features that are plotted over the original image. 
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(a) 4-connected (b) 8-connected 


Fig.4.7 Two typical connections 
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Fig. 4.8 Plotting HOG features over the MATLAB image: cameraman 


e Moments 

If we have a polygon to describe the shape of an object, we find the centroid coor- 
dinates by using average shown in Eq. (4.27). The moments include raw moments, 
central moments, scale-invariant moments, rotation-invariant moments, etc.; the cen- 
tral moments are calculated by using Eq. (4.28). 
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(1) Raw moments 


Mpq = 5 X IG, y) x -yP (4.26) 
x y 
(2) Central moments 
ee Le 
Gus nL Ye = w (4.27) 
Mpg = 9 10y) — cP — Ye)? (4.28) 
x y 


(3) Scale-invariant moments 


nig = Hij 
y ~~ 1 i+j 
r i ) 


(4.29) 


(4) Rotation-invariant moments 
The most frequently used moment is the Hu invariant moments, 
L = n2 + no2 


h = (m — N02)? + 47, 
B = (mo — 3112)? + (3n21 — n03)? 
I4 = (mo + n12)? + (mı + n3)? 


I5 = (mo — 3112) (n30 + n120 + n12) — 3(n21 + nos)? 1+ 
(3n21 — no3)(n21 + 703)[3(n30 + n12)? — (m21 + 103)7] 


I6 = (mo — no2) L730 + 112)? — (M21 + no3)?]+ 
4n11 (130 + 112)(m21 + 03) 


Ty = (Bm — no) (mo + n120 + 112)? — 3(n21 + no)? 1— 
(n30 — 312) (n21 + nos) [B (N30 + n12)? — (mı + no3)?] 


Ig = nuil(n30 + m2)? — (No3 + m1)?] — (Mo — no2) (n30 + n12) (n03 + N21) 


e Transforms 
Sine and cosine functions construct an orthogonal function system; namely, m and 
n are integers, 


< sin(mx), cos(nx) >= f sin(mx) cos(nx)dx = 0 (4.30) 


=E 
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< sin(mx), sin(nx) >= f sin(mx) sin(nx)dx = | 7 á : (4.31) 
al. Paes 4.32 
< cos(mx), cos(nx) >= fe cos(mx) cos(nx)dx = ene (4.32) 


Therefore, Fourier expansion of a continuous function f (x) belongs to L? space 
(p = 2) or Lo, namely J”, f (x)|?dx < 00 


[0,0] 


fŒ) = ao + X la, sin(nx) + b, cos(nx)] (4.33) 


n=1 
We have n > 0, 
ao = x J" f dx 


an =< f (x), sin(nx) >= 4 f7 fœ sin(nx)dx (4.34) 
bn =< f (x), cos(nx) >= + [™ f Œ) cos(nx)dx 


where < - > is the inner product or dot product. Therefore, 


fx) = ao + YS I< f@), sin(nx) > sin(nx)+ < f (x), cos(nx) > cos(nx)] (4.35) 


n=1 


Fourier expansion for function f (x) converges, it equals to f (x) wherever f (x) is 
continuous [50]. 
As Euler’s formula, 
e'* = cos(x) +i- sin(x) (4.36) 
Hence, 


ix —i-x ix _ pix 


jax e F ; e 
cos(x) = Re{e'™} = ; sin(x) = Im{e™} = 


furthermore, 
cos(n- x) = 2 - cos| (n — 1) - x] - cos(x) — cos[(n — 2) - x] (4.38) 


When n = 2, 
cos(2 - x) = 1 — 2 - cos? (x) (4.39) 


The special example is that wavelet transform uses wavelet basis functions to con- 
struct their orthogonal function systems so as to decompose digital signals or images 
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at multiple layers with multiple resolutions [99] which outperforms than DCT [76] 
and Fourier transform [72]. 


e Local Features 

A corner in an image is given at a pixel where two edges of different directions 
intersect. Corners usually lie on high-contrast regions of the image. Relative posi- 
tions between corners in the original scene should not be changed after a variety of 
transformations. 

Corners are invariant to scaling, orientation, and distortions. The best match for 
each pair of corners is found by identifying its nearest neighbor of corners. The near- 
est neighbors are defined as the corners with the minimum distance from the given 
descriptor. If a pixel is inside an object, its surroundings (solid square) correspond to 
those of its neighbor (dotted square). This is true for adjacent pixels in all directions. 
If a pixel is on the edge of an object, its surroundings differ from its neighbors in 
one direction, but correspond to the surroundings of its neighbors in the other (per- 
pendicular) direction. A corner pixel has surroundings which are different from all 
of its neighbors in all directions. 

Therefore, a corner in an image J is found at a pixel where two edges of different 
directions intersect. Corner detection usually uses its eigenvalues of a Hessian matrix 
of a pixel p. If the magnitude of both eigenvalues is “large” in Eq. (4.40), we say the 
pixel is at a corner, 


Hip) = E J (4.40) 
xy “yy 
where partial derivatives are defined, 
Ol (x,y) _ E I(x + Ax, y) — I(x, y) 
ax ky) = pes Ax Gor) 
ol A I ’ A —I > 
OY) eis tee, Oe (4.42) 
dy Ay>0 Ay 
and, 
ol (x, ) aly (x, ) 
SEO” = lex, y); 2 = helt, y)s 
Ox Ox 
aI. (x, y) Aly (x, y) 
eee E Ly, y); ~— = Iyk, y); 
dy dy 


Corner detection takes advantages of Harris detector with the cornerness mea- 
sure [54], 


L2(p,0) — Lx(p, o)Ly(p, 0) 
G0.0)=[1 ioe) 20,0) | (4.43) 


furthermore, 
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L(p, o) = [I * Ga|p(, y) (4.44) 


where J is an image, and Gg(a > 0) is a local convolutional function. 
N(p, a, o) = det(G) — a - Tr(G) = d1A2 — a: (Ay + à2) 


where «œ is a constant. Corner points have large, positive eigenvalues and would 
thus have a large Harris measure. Corners are applied to object tracking as shown in 
Fig. 4.9. 

Scale-invariant Fourier transform (SIFT) helps us find the corresponding key- 
points from two similar images no matter how the images are zoomed. Keypoints 
have the steepest gradient than the points around it. These keypoints are very robust 
in describing objects no matter how we use them in object detection or tracking [90]. 
SIFT for image matching usually has to experience following steps: (1) corner detec- 
tion is based on changes of the adjacent regions, the changes of corner points are 
different from the changes of their neighbors, namely inner points; (2) local neigh- 
borhood description is based on the circular region of a corner; (3) corner matching 
uses the distance of this circular region. 

An example is shown in Fig. 4.10; 1993 and 1856 keypoints were detected from 
the grayscale ones of the left and right images, respectively; 134 matches were 


Fig.4.9 Corners for object tracking 
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(1993 keypoints found) (1856 keypoints found) 


[Found 134 matches] 


Fig.4.10 Example of the SIFT 


found when we used the SIFT algorithm provided from: http://www.cs.ubc.ca/~lowe/ 
keypoints/. 

Speeded-up robust feature (SURF) could improve the SIFT for seven times faster, 
random sample consensus (RANSAC) separates inliers from outliers through fitting 
or regression, the hypothetical inliers forms consensus set. Corners of digital images 
usually are regarded as hypothetical inliers; the inliers selected from the keypoints 
will be employed for image registration, mosaicking, etc. [54]. 


e Distance between Visual Feature Vectors 

After selected all features, we integrate them as elements into one feature vector 
Vy =[C, H, T, M, ..., W], where C = (u, o) is color information; u is aver- 
age of the colors; ø is the variance; H = (hy, Mm, ..., hg) refers to histogram which 
depends on number of bins (B); T = (c1, c2, . . . , Cn) represents texture coefficients; 
M = (mı, m, ..., ms) describes coefficients of moments; W = (w1, W2,..., Ws) 
depicts frequency coefficients after visual signal decomposition, etc. The visual fea- 
ture vector Vp has been used for the further computing such as search, retrieval, 
mining, and reasoning [6]. 

If we have two computable feature vectors V; and V2 to describe two objects 
respectively, the two objects are compared through distance of the two vectors d = 
(Vi, V2). The inner product between these two vectors usually is used to calculate 
the cosine value; the value decides how the two images are similar to each other in 
use of cosine function: 


d(V;, V2) = cos(0) = ———* (4.45) 
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If the value is cos(@) = 1, cos(@) € [0, 1], that means the two objects are the 
same; if cos(@) = 0, directions of two visual vectors are perpendicular. Moreover, 
we calculate the distance by using Eq. (4.46) 


1 
n p 
d =|1Vi —Vallp = 2 Ixi — wr’) (4.46) 
i=0 


This is called p-norm distance, where Vj = (x1, X2, .. ., Xn) and V2 = (1, y2,..-, 
Yn). 

In Eq. (4.46), if p = 1, the metric is called Lı distance; if p = 2, the distance 
is named as Ly or Euclidean distance; if p —> oo, the distance is titled as oo-norm 
distance Læ =max (|x; — yil); ifp — 0, the distance is thus defined as 0-norm distance 
Lo=min(|x; — yil); 

In mathematics, Euclidean distance is the “normal” distance between two vectors 
that one would measure it with a ruler. By using this distance, Euclidean space (or 
even any inner product space) becomes a metric space. Namely, for any x, y, z € M: 


d(x,y) > 0 
d(x,y)=0@x=y 
d(x, y) = d (y, x) 

d(x,z) < d(x, y) + d (y, z) 


where d is a distance defined on a set M. 
Other feature vector distances include Mahalanobis distance in statistics shown 
in Eq. (4.47), 


d (V1, V2) = V (V1 — Va)? S-1(Vq — V1) (4.47) 


where S is the covariance matrix for vector Vj and V2. 

In mathematics, Hamming distance between two strings of equal length is the 
numbers of positions at which the corresponding symbols are different. In another 
way, it measures the minimum number of substitutions which is required to change 
one string into the other, or the minimum number of errors during the transformation 
from one string into the other. Hamming distance between these two numbers is 
denoted by using the number of differences in the binary representation shown in 
Eq. (4.48). 


d(s1, 82) = ) > 8(c1i, €2) (4.48) 


i=l 


0 Cli = Ci 
where sı = (c11C12 ++ + C1n)2 and sz = (€21€22 +++ C2n)2, Ô (C1i, C2i) = 1 í ' 
Cli F C2; 
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4.2 Object Segmentation 


Object segmentation is the first step of object analytics which separates different 
regions encapsulated in an image. The first simple approach is for segmenting a 
color image by using statistical result in a color space as shown in Fig. 4.11. The 
result tells us the color components of this image and how the colors are distributed; 
color distribution of the sample image is shown in Fig. 4.12. Based on the color 
distribution from the sample image, we segment the image into different regions; 
each region is represented by using the color as its class. 


Mathematically, we partition an image {2 into a finite number of regions S;, i = 
1,..., n; the region S; satisfies: 


Si Æ Ø, Vi € {1,2,...,n} 
e U} SGER 
Si O Sj=0, Vi, j € {1,2,..., n} with i Æ j 


Approach I. Color-based segmentation using the color space; we need to: 


e Acquire an image. 

Calculate sample colors in RGB color space. 

Classify each pixel using the nearest neighbor rule. 
Display results of the nearest neighbor classification. 
Mark each pixel using the color class number it belongs to. 


Pm 


Fig.4.11 A color image with flowers 
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0 0 


Fig.4.12 Color statistical result for image segmentation 


Approach II. Marker-controlled watershed segmentation 

The second approach of image segmentation is to use watershed segmentation. The 
result is shown in Fig. 4.13, the steps of this algorithm are listed as below. We need 
to: 


Import in a color image as the input and convert it into grayscale. 
Use gradient magnitude as the segmentation function. 

Mark the foreground objects. 

Compute background markers. 

Visualize the result. 


We provide a result of watershed image segmentation using the well-known 
OpenCV platform in Fig. 4.13. We manually assign regions on a color image, the 
watershed algorithm finds the similar regions using the gradient intensity. Once a 
peak reaches, the boundary is found, and the image will be segmented (Fig. 4.14). 


Approach III. Texture segmentation using filtering 
The steps of this kind of object segmentation are to, 


Import an image as the input. 

Create texture image. 

Create rough mask for the bottom texture. 
Use rough mask to segment the top texture. 
Display segmentation results. 
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(a) (b) 


Fig. 4.13 Original image (a) and the segmented image (b) using the watershed segmentation 
algorithm of OpenCV 


p t wwe “A 


Fig. 4.14 Result of texture-based image segmentation using MATLAB 


Approach IV. Markov random field (MRF) has been employed to image segmenta- 
tion. Figure is an example of image segmentation using the algorithm from the 
Web site: i . (a) is a color 
image for segmentation, and (b) is the segmented image after assigned 5 classes for 
the image. 

Markov random field (MRF) is defined as, 


Pilfy—-w) = PFIN) (4.49) 
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(a) (b) 


111 


Fig.4.15 Image segmentation using Markov random field 


where p(f) > 0, fy = {fyli € M}. N is a neighborhood system and f € F is the 
Markovianity. M = {.%|Vi € Z}, % is the set of sites regarding to i ¢ -%. More- 
over, i’ € M => ie M [60]. 

In the 2D case, -Z (i, j) is the lattice, and each site (i, j) has four neighbors 


Mj={i- 1j), +L), ij- D, j+ DI (4.50) 
For example, edges correspond to abrupt changes or discontinuities between 


neighboring areas. If g(-) is the truncated quadratic potential function, the pixel 
values near the edges are through minimizing 


f* = argminE(F) (4.51) 
f 


and, 


E®) =) fi- - Y DO higi- f) (4.52) 


ieS ie S VEN 
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4.3 Object Annotation, Classification, and Recognition 


OpenCV is a computer vision platform which is able to detect and recognize moving 
objects [22,53]. A face recognition result based on OpenCV is demonstrated in 
Fig. 4.16 [94-96]. 

In face recognition, we need firstly to detect a human face using OpenCV, and then 
recognize the face using a classifier assisted by a training set. In face recognition, in 
order to increase accuracy, irrelevant moving objects in the background of the scene 
have to be removed. Precision and recall are calculated to compare the results before 
and after moving object removal (MOR) [13]. 

A regular definition of automatic license plate number recognition is to use dig- 
ital image processing and computer vision to recognize each character in pictures 
containing plate number information automatically [2,4,14,18]. 

In general, basic modules of an automatic license plate number recognition sys- 
tem consist of plate detection [102], known as plate number extraction [64,68,74], 
character segmentation, and character recognition [19,55]. In order to implement the 
functions of each module, numerous techniques have been developed and presented. 
Leveraging the efficiency and costs, computer vision and digital image processing 
is broadly adopted in practice [3,4,15,18]. 

The target of character feature extraction [65] is to find a particular transition 
that can find out the minimum features which represent the original whole data. As 


Fig.4.16 An example of human face recognition 
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a result, the processing will be accelerated and the storage space will be minimized. 
The popular methods of feature extraction include zoning, projection histogram, 
distance profile, background directional distribution, and Fourier descriptors. Using a 
combination of features will increase efficiency of the character recognition. Hence, 
these features used in this example are extracted by zoning method and Fourier 
descriptors. The thought of zoning method is to calculate the pixel number in a 
particular area which is divided by using a grid. After obtained the numbers of pixels 
in each predefined area, a normalization will be conducted. 

A desired automatic license plate recognition (LPR) [18,55] system adopts digital 
image processing techniques to locate and recognize the characters on car plate 
number and output the results as a textual string or other formats that can be easily 
understood in semantics [3,68,73]. The LPR system has been applied to various 
applications which require automatic control of the presence and identification of a 
motor car by using its plate number, such as stolen vehicles, automatic electronic 
toll collection (ETC), automated parking attendant, traffic ticketing management, 
security control, and others [5,11,51]. A plate number recognition system usually 
consists of five important modules: image acquisition, image preprocessing, plate 
number detection [64], image segmentation, and character recognition [3,4,98]. 

Plate number detection model plays a pivotal role in plate number recognition [55, 
68]. If the position of a plate number in an image cannot be located accurately at 
very beginning, the following steps will not be continued correctly; thus, the plate 
number may appear at anywhere within the image [81,104]. To detect the location 
of a plate number [104], the particular features for plate number recognition have to 
be considered and determined [64]. The candidate plate number region is extracted 
and further verified if the plate number is correctly located in the image [61,65]. 

In plate number scan, a RGB image will be converted to HSV color space [58, 
88,97, 100]. A binary image will be obtained by detecting red pixels in HSV color 
space in accordance with prior knowledge of hue and saturation of the red color. 
After that, morphological operations such as closing and erosion are applied to filter 
out the noises [63,92, 103]. The small redundant regions will be omitted by using 
image opening operation. 

Normally, the two red regions indicated taillights of a car will be displayed in a 
filtered binary image. The two red taillights should appear at the same vertical level. 
Hence, the conditions are set to examine whether the detected region is required; 
the decision condition is whether the distance between two centroids of the detected 
areas is big enough. If the value is less than the predefined threshold, the detected 
regions are two taillights in the same line [94—96]. 

Template matching is defined as a technique in digital image processing to match 
a given image using multiple templates [35]. The plate number recognition is to 
compare each given image and the templates so as to find out which template is the 
best to match the given image [98, 101]. There is a slew of matching functions which 
are employed to measure the similarity between the given image and the templates. 
The general matching functions include sum of square differences (SSD), normalized 
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cross correlation (NCC), and mean absolute difference (MAD). 
W H 
dsso(l, Ir) = $ YOU) — GDP (4.53) 
i=l j=1 


where / is the given image, Ir is the template, and W and H are the height and width, 
respectively. 


UG, j) —riar (4.54) 


x 
M= 
Mea 


d I, Ir) = 
MADU, Ir) Wx 


1 wW E UG TG. Ura,- rG, j 
dec Ir) = yy o UGA) — I, j] - UG) — Ir (i, j)] (4.55) 


OIO, 
i=l j=l Pt 


where J and Tr are the means, o7 and o7, are the standard deviations for image J and 
template /7 respectively. 

In real instances of license plate recognition [59], when a road is uneven with 
bends, a vehicle will be running with shaky. Consequently, the plate is also unstable 
and tilted with rotations. In this case of plate correction, Hough transform was used 
to detect the straight lines of characters from the acquired plate images; the tilt angle 
of a straight line in the horizontal direction was calculated, and the angle of rotation 
was detected. The image was rotated back to the horizontal direction using the angle 
detected. After the rotation, a bilinear interpolation will be applied to reconstruct the 
corrected image. 

The methods of license plate character recognition were based on template match- 
ing and artificial neural networks. In this case of plate recognition, the genetic algo- 
rithm (GA) was used as a search method, the weight of each network is a gene 
position of the chromosome, the length of chromosome corresponds to all values 
of the network; the accuracy of GNN-based recognition is higher than that of the 
backpropagation (BP) algorithm. 

Artificial neural networks (ANNs) and deep learning (e.g., CNN and RNN) [28, 
56,57] have been employed to character recognition [43,69]. The neural network was 
developed with multilayer feedforward backpropagation algorithm using one hidden 
layer [16]. ANNs have been applied to classify the number plate from color images. 
An example of automatically number plate recognition is shown in Fig. 4.17 [94-96] 
(Fig. 4.18). 
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Fig.4.18 An example of point tracking 


4.4 Object Locating and Tracking 


Assume we have segmented all visual objects from an image; we need to locate 
and track the objects [30,61]. The tracking algorithms include point tracking, ker- 
nel tracking, silhouette (contour and shape) tracking, etc. [90]. In point tracking, 
deterministic and probabilistic approaches are employed. Probabilistic approaches 
are statistically based. 

Kernel tracking is template- and multiview-based. The template matching is based 
on calculations pixel by pixel. In general, approaches of this kind are very slow 
in computing. Multiview algorithms include two parts, namely view subspace and 
classifier [7]. 

Silhouette tracking includes contour evolution and shape matching. Contour evo- 
lution refers to state-space methods and direction minimization. Direction minimiza- 
tion encompasses variational approach and heuristic approach. 
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Fig.4.19 An example of silhouette related to human gait 


Silhouette is represented as a solid shape of a single color; an example of silhouette 
is shown in Fig. 4.19. The interior of a silhouette is featureless, and the whole is 
typically presented on a light background. It was used to describe cut papers which 
were stuck to a backing in a contrasting color and often framed. 

Blob is a very similar concept in computer vision. Informally, a blob is a region 
of an image in which some properties are constant or approximately constant; all the 
points in a blob can be considered to be similar to each other. Two kinds of methods 
are used to detect blobs: differential methods and method-based local extrema. 

MPEG videos already stored motion vectors in themselves which were calculated 
by using the technique called optical flow which is an approach to describe motions 
in a video [75]. The motion vector includes motion estimation of objects; therefore, 
we are able to use the forward and backward motion vectors to track a moving 
object [67]. Because the motion vectors are already saved in our video footages 
during the compression time, that will greatly save our computing time. 

Suppose optical flow u = [u, v]! , visible replacement starts at p = (x, y) and ends 
atp = (x + u, y + v). Optical flow aims at 2D motion estimation. 2D motion vectors 
form a vector field; a motion vector field is dense if it contains motion vectors at all 
pixels; otherwise, it is sparse. 

From Newton—Leibniz formula and Taylor expansion , we know if 


I(x, y, t) = I(x + ôx, y + ôy, t + ôt) 
then, 


ol al al 
ôx - (x, y, t) + dy- (x, y, t) + ôt- (x, y,t) =0 
Ox oy ot 
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Fig.4.20 Optical flow detection using OpenCV 


hence, 
ôx 
ôt 


If we assume, 


al dy al al 
9 y+ z ay E D J+ ae (x, y,t) =0 


Gay.) = uyd, vay D) = (Z, 
ux, y, = ux, y, > VX, Y, = bt’ St 
g(x, y, t) = (x, y, t), Ly, y, t) = Vy 
then, 
u(x, y, t) - gx, y, ) = —h(x, y, t) 


Namely, optical flow equation is, 
u:-g=-—l, 


where g is the gradient vector. Following this, the typical algorithms for seeking 
optical flow u = [u, v]! are Horn-Schunck algorithm and Lucas—Kanade algorithm; 
interested readers could find a computer vision book to understand the details. An 
example for calculating optical flow is shown in Fig. 4.20. 


102 4 Surveillance Data Analytics 


Typical algorithm for object tracking is mean shift which is based on template 
matching. Usually, we begin from the starting position of a model in the current 
frame; we search the neighborhood in next frame and find the best candidate by 
maximizing a similarity function (usually 10% step length will be taken). Once it is 
found, we use the best position as the new start, search for the next best region in 
the next frame, and so on; finally, we are able to get the goal of object tracking. The 
advantage of this algorithm is its simplicity and ease for implemented as others, but 
the disadvantage is its slowness and ineffectiveness. 

The additional algorithms for object tracking also include Bayesian filtering, 
Kalman filtering [36], particle filtering [17,17], and data association. 

The objective of Kalman filtering is “Noisy data in, hopefully less noisy data out.” 
Therefore, for a linear dynamic system x = A - x, A is a constant n x n matrix. 


From e* = 1+ 0%, F, we have the state transition matrix Fa; = eĉ^'^ = I + 


ye At AÏ 
i=1 i * 
For a discrete system, 


XxX; = Fx;_ + Bu; + W; (4.56) 


y, = Hx, + v; (4.57) 


where B is the control matrix, u; is a system control vector, wy is the noise vector, H 
is the observation matrix, y, is the noise observations, and vz is the noise observation 
vector. 

Kalman filtering updates our knowledge based on experienced prediction errors 
and observations, we use the improved knowledge for reducing prediction error. In 
predict phase, 


| R1 = FX;—1r-1 + Bu; (4.58) 


Pa-1 = FP;-1:-1F' +Q, 
In update phase, 


Zp =y — HX;\;—1 
S; = HP;;-1H'! +R; (4.59) 
Rie = Xyr—1 + Ke, 


In optimal Kalman gain, 


—1 
| ae (4.60) 
Py, = A — K;H;)Py:-1 
The matrix 
K; = Py1H'S;'! (4.61) 


minimizes the mean square error E[(x; — Ri)? which is equivalent to minimize the 
trace of P;|,. The matrix K; is known as the optimal Kalman gain. 
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If we have the continuous model matrix A for the given linear dynamic process 
x = A - x. The alternative model for predict phase is much straightforward, 


| Kee = AX 1-1 + Bu, (4.62) 


Py1—-1 = AP;-1:-1A' +Q, 


A particle represents a feature in a multidimensional space, the dimensions of 
this space combine locations with descriptor values in a parameter vector. A particle 
filtering can track parameter vectors over time or within a space based on evaluating 
consistency with a defined model. A condensation algorithm is used to analyze a 
cluster of weighted particles for identifying a winning particle. 

The iterative condensation process decides which of the randomly generated par- 
ticles is taken as a result for the next image row. One iteration of condensation is 
called resampling, the goal is to improve the quality of the particles. A particle with 
a high weight is very likely to survive the resampling process. Resampling takes all 
the current weighted particles as input and outputs a set of weighted particles. 

Image classifiers are grouped into two categories, namely learning-based or para- 
metric classifiers such as artificial neural networks (ANNs) which rely on a training 
period and nonparametric classifiers such as k-nearest neighbor (k-NN) which do 
not require any training time, can handle a large number of classes, and can avoid 
the over-fitting problem [43,45,46]. 

The k-nearest neighbor classifier and other nonparametric classifiers have been 
made primarily due to performance diversity. Multilayer perceptron (artificial neural 
network)-based (parametric) classifiers outperform nearest neighbor, Bayesian and 
minimum-mean-distance (nonparametric) classifiers especially with noise. Learning- 
based classifiers including ANNs also outperform nonparametric classifiers with 
regard to error rate. An issue with ANNs is that high levels of accuracy come at the 
expense of considerable training time. 

Support vector machine (SVM) is now generally thought as being superior to the 
single-layer ANN with regard to generalized accuracy in image classification. This 
refers to the SVM as having higher levels of accuracy when classifying an unknown 
test set. 


4.4.1 Support Vector Machine (SVM) 


In computer vision, unsupervised learning is to find hidden structure in unlabeled 
data. This distinguishes unsupervised learning from supervised learning and rein- 
forcement learning. Supervised learning is a machine learning task of inferring a 
function from labeled training data. The training data consists of a set of training 
examples. In supervised learning, each example is a pair consisting of an input object 
(typically a vector) and a desired output (also called the supervisory signal). A super- 
vised learning algorithm analyzes the training data and produces an inferred function, 
which can be used for mapping new examples. 

The typical supervised learning algorithms include support vector machine (SVM) 
and decision trees. For SVM, 
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Fig. 4.21 Margin distance between two classes of SVM classification 


Problem: Given a vector dataset, no matter how high the dimension is, how to find 
a linear hyperplane (decision boundary) that will separate the data? 


Solution: It is better to find a hyperplane that maximizes the margin. 


If a line is W - X — b = 0, the margin will be between the two lines W - X — b = 
+1 and W - X — b = —1; the distance between the two parallel lines is d = Tw 2$ 
shown in Fig. 4.21. 

The SVM is to maximize the margin of the given samples which is shown in 


Eq. (4.63). 
2 W 
ome s }—_ 4.63 
A ah í ! 
S.t. 
U pes (4.64 


Furthermore, the SVM problem is written as: 


1 T 
—— , s.t.f(wxa+b)>1i=l1,...,n 
IWI min 


where Wy is the geometric margin, and y = f (wT x + b) is defined as the functional 
margin. 

The SVM is known to generalize well even in high-dimensional spaces under 
small training sample conditions. The SVM is a binary classifier which determines 
the hyperplane that best separates two classes of information from the infinite number 
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of hyperplanes that may exist. The optimal separating hyperplane is the one that gives 
the largest margin between the two classes. Therefore, we define a Lagrange function 
as, 


1 n 
L(w, b, a) = slr? -J aj(f (wxi +b) — 1) 
i=l 


where œ = (œ1, @2,..., æn) is the Lagrange multiplier. Hence, 


n 


=0> w= Yo aixiyi 
i=1 


dL(w, b,a) 
ðw 


and, 


əðL(w, b, 
ne @ =0> Daw =0 


In the best case, these two classes are linearly separable; however, in most real- 
world scenarios, this is not true. To remedy this, the nonlinear SVM was created by 
adding a kernel function to the algorithm [7]. This allows the data to be mapped to a 
higher-dimensional space in a way that allows it to be linearly separable; meanwhile, 
the SVM is applicable to a wide range of image classification problems (Fig. 4.22). 


n 
w= J ajxiyi (4.65) 
M T 
fœ = wix+tb= 2 aa) x+b (4.66) 
i=1 
f@= ce Yi < xi, x > +b (4.67) 
i=1 
In higher dimension, we have, 
fœ = Sa x) +b (4.68) 


i=1 


where «(x;, x) is the kernel function in SVM which maps data points of a low- 
dimensional space to the high-dimensional one for classification [21]. 
A kernal function is a nonnegative real-valued integrable function satisfying: 


+00 
f k(u)du = 1 (4.69) 


(00) 
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Fig.4.22 SVM mapping for a lower-dimensional space to the higher-dimensional one 


Fig. 4.23 Confusion matrix 


: : : Actual class 
for binary classification 
Yes No 
g ğ Yes True Positives False Positives 
Ss 
© o 
à No False Negatives True Negatives 
and, 
k (u) = k(—u), u € (—00, +00) (4.70) 


Gaussian function, sigmoid function, cosine function, logistic function, etc., could 
be used as a kernel. For example, Gaussian kernel is, 


( =a) 
k (x1, x2) = exp | ——— (4.71) 


20? 


In a supervised learning, we typically need labeled training dataset, namely ground 
truth. Therefore, we split the dataset into positive and negative parts. For example, 
University of California Irvine, USA, (UCI) and National Institute of Standards 
and Technology (NIST) provided such labeled datasets in their Web sites for public 
downloading. 

In a supervised learning, a typical classifier will classify a test dataset into two 
classes (Yes or No); these classes will construct a confusion matrix which consists 
of actual class and predict class in true positive (tp), false positive (fp), true negative 
(tn), and false negative (fn) as shown in Fig. 4.23. A more general confusion matrix 
is shown in Fig. 4.24, where cj; is the number of samples predicted to be classified 
into class i, actually classified into class j after compared with the ground truth. 

In a supervised learning, precision (PR) and recall(RC) are defined as below: 


t 
R= 
tp + fp 


(4.72) 
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Fig. 4.24 Confusion matrix 
for classifying multiple Actual class 
classes 
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t 
c= (4.73) 
tp + fn 


where tp, fp, fn, and tn are the true positive (a hit in classification), false positive (false 
alarm), false negative (missing classification), and true negative (correct rejection) 
respectively. The tp, fp, fn, and tn show among the search results how many search 
results reflect the ground truth exactly. Furthermore, F-measure or F-score (F), G- 
measure (G), accuracy (AC), sensitivity (SE) and specificity (SP), and miss rate 
(MR) are listed as: 


2. PR- RC 
Zoan (4.74) 
PR + RC 
G = VPR. RC (4.75) 
t 
2 a E (4.76) 
ip + in+ fp +fn 
SE = (4.77) 
tpfa l 
tn 
sP = (4.78) 
tn + fp 
R=” (4.79) 
tp + fn 


For example, If we have tp = 10, fp = 20, tn = 30, and fn = 40, therefore, 


RE tp 10 _ 
~ tp+fp 10+20 3 


(4.80) 


tp 10 


RC = => = 
tp+fn 104+40 5 


(4.81) 
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= tp + tn _ 10 + 30 _ 2 (4.82) 
~ p+intf+m 104+30+20440 5 ` 
ti 10 
simona = (4.83) 
tp+fn 10+40 5 
pz” -2 (4.84) 
m+fp 30+20 5 
40 
ae L = (4.85) 
tp+fn 10+40 5 
2-p-r 1 
Fo A a (4.86) 
PR+RC 4 


G = JPR- RC = yE (4.87) 


Note: F is harmonic mean (average) of recall and precision, and G is geometric 
mean (average). 

A receiver operating characteristic curve (ROC) is a graphical plot that illustrates 
the performance of a binary classifier system as its discrimination threshold is varied. 


The curve is created by plotting the true positive rate (0 < TPR = P < 1) against 
the false positive rate (0 < FPR = fp < 1 )at various threshold settings. ROC is a 


: ; forin « eats 
comparison of two operating characteristics (TPR and FPR) as the criterion changes. 


The area under the curve is abbreviated as AUC. A ROC curve is shown in Fig. 4.25. 

The often used software for data classification is Weka and R programming lan- 
guages which include multiple statistical algorithms for classification, clustering, 
and regression (Fig. 4.26). 

In Weka, the recommended file format is Attribute-Relation File Format (ARFF) 
which consists of American Standard Code for Information Interchange (ASCII) 
codes and describes a list of instances sharing a set of attributes. 

Figure 4.27 shows an ARFF file opened by using a text editor, where “@RELA- 
TION iris” indicates the data is related to iris, “@ATTRIBUTE sepallength REAL” 
shows the attribute of the field “sepallength” is real, “@ ATTRIBUTE class {Iris- 
setosa, Iris-versicolor, Iris-virginica}” tells us the “class” field is the labels of the iris 
including “Tris-setosa,” “Tris-versicolor,’ “Iris-virginica,” and the important informa- 
tion is applied to algorithm training and test. “@ DATA” refers to the records that 
will start from the below. The record “5.1, 3.5, 1.4, 0.2, Iris-setosa” means the first 
record of this table is consisting of four real numbers and one label. The records will 
be listed till end of this file. 
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Fig.4.25 A ROC curve in 
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Fig. 4.26 Interface of the software Weka 
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@RELATION iris 


@ATTRIBUTE sepallength REAL 

@ATTRIBUTE sepalwidth REAL 

G@ATTRIBUTE petallength REAL 

@ATTRIBUTE petalwidth REAL 

@ATTRIBUTE class {Iris-setosa, Iris-versicolor,Iris- 
virginica} 


@DATA 
l e D .2,Iris-setosa 
.2,Iris-setosa 
.2,Iris-setosa 
.2,Iris-setosa 
-2,Iris-setosa 
-4,Iris-setosa 
.3,Iris-setosa 
.2,Iris-setosa 
.2,Iris-setosa 
-1,Iris-setosa 
.-2,Iris-setosa 
.-2,Iris-setosa 
-1,Iris-setosa 
-1,Iris-setosa 
.2,Iris-setosa 
.-4,Iris-setosa 
.-4, Iris-setosa 
-3,Iris-setosa 
-3,Iris-setosa 
.3,Iris-setosa 


` 
` 


` 


` 
` 
` 


` 
` 
` 


` 
` 
` 


` 
` 
` 


` 
` 


` 
` 
` 


` 


` 
` 


` 
` 


` 


sss 
E 
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` 
` 


` 
` 


` 
` 
` 


DONOPOCOSIHOAMONAHNOUN 
ee Dt eS Op Op pt pe Be Pe Oe OP 
OI POUND Wb UW bb 
CODDDDDDODOD0D000000000 


` 
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` 
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` 
` 


Fig. 4.27 Format of the ARFF text file 


4.4.2 Artificial Neural Networks (ANN) 


Artificial neural network (ANN) is a biologically-inspired model which consists of 
neurons and connections using weights [43—46]. ANN constitutes of the neuronal 
structure, training and recall algorithms which are a connectionist model. Because 
of connections or links, the connection weights are the “memory” of the ANN sys- 
tem [46—48,85]. 

A standard model of ANNs is shown in Fig. 4.28. For the inputs x = (x1, X2, ..., Xn) 
and the associated weights w = (w1, W2,..., Wn), wi = 0, we have a summation 


n 
function u = f (x, w) = }_ (wi - x;) and the activation function œ = s(u) to calculate 
the activation level of a R the output function g is used to compute the output 
signal value emitted through the output(axon) of the neuron o = g(q@). The output 
values of a neuron are within [0, 1]. 
ANNs are thought as nonlinear classification [20]. For example, the discriminant 
function of a three-layer neural network is, 


nh d 
ge) = =f | Do wy Ff (>> wax + wo ) + weo (4.88) 
j=l i=0 
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Fig. 4.29 Activation functions in artificial neural network 
Kolmogorov Theorem. Any continuous function g(x), X = (x1, X2, ..., Xq) € 
[0, 1]4, d > 2 could be represented by, 
2n+1 d 
T wa) (4.89) 
j=1 i=0 


From Kolmogorov theorem, we know that ANNs have the ability for nonlinear 
classification. In ANNs, the activation function s in history had adopted the following 
functions shown in Fig. 4.29, respectively. 


e Hard-limited threshold function, 


1d >e 

wea (4.90) 
where £ > 0 is the threshold. 

e The linear saturated function, 


my) =yy € [0,1] 
xy) = 0 y <0 (4.91) 
1 y>l 
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Fig. 4.30 Alpha function used in spiking neural network 


where y = 0 is the saturated threshold. 
e Sigmoid function, 
1 
The sigmoid function has s-shape, and it is monotonically increasing, continuous, 


and smooth. 
e a-function, 


s(u) = (4.92) 


1 
X- sax X > 0 

sw=f ó z0 (4.93) 

This function is usually used for the third generation (3G) of neural network: 


spiking neural network (SNN) [29,43,86] shown in Fig. 4.30. 


In summary, the model of ANN is an assembly of interconnected nodes and 
weighted links, the output is to sum up each of its input value according to the 
weights of its links. 

Step 1. Initialize the weights (w1, w2, ..., Wn). 
Step 2. Adjust the weights in such a way that the output of ANN is consistent with 
class labels of training examples. 


e Objective function: 


e= : $ (0i = gwi, x)? (4.94) 


I 
where o; is an output, x; is an input, w; is the corresponding weight, and g(-) is the 
training function. In ANNs, the objective function usually is an average related 
to the input and output. Equation (4.94) is quadratic cost or sum squared error. 
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Fig. 4.31 Playground of deep learning 


Other objective functions include cross-entropy cost, exponential cost, Kullback— 
Leibler (KL) divergence, etc. [12,28]. 
e Weights: 
The goal of ANN computation is to find the weights w;, i = 1,2,...,n that 
minimize the above objective function or cost function in Eq. (4.94), dew) = 0. 
e Algorithms: 
The algorithms to train ANN weights and achieve the minimum objective func- 
tion typically include gradient descent, Newton’s method, conjugate gradient, 
quasi-Newton method, and Levenberg—Marquardt algorithm in mathematics. For 
example, the Newton’s method for seeking the weights is, 


gxi) 


Wi41 = Wi -—mM- 4.95 
i+] i 2! (xi) ( ) 
where m is a factor, we call it as learning rate. 
Therefore, gradient descent (i.e., method of steepest descent) is written as, 
wi+1 = Wi —m- Vg(wi) (4.96) 


where — V g(w;) is the direction of normal vector. 


In order to understand these concepts well, a software called Tinker based on 
TensorFlow was developed, and its interface is shown in Fig. 4.31. 

Compared to the peer algorithms, ANN has the weights as its memory and satisfies 
the requirements of event detection and recognition on spatiotemporal relationship 
related to time serial analysis [70]. ANNs have the merits to replace FSM, HMM, 
and SVM in machine learning and pattern classification [43]. Recent years, ANNs 
exhibit its mighty strength in deep learning [57] and big data analysis [37,86]. 
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NeuCube is a framework for the development of spiking neural network (SNN) 
systems in data mining, pattern recognition, and predictive data modeling with com- 
plex and large data, especially for the spatiospectro-temporal data (SSTD) [29,49]. 
KEDRI from the Auckland University of Technology (AUT), New Zealand, has 
developed the Neurogenetic Cube system recently which has the architecture of a 
STDM [48]. It is a sophisticated framework of methods that facilitates efficient solu- 
tions to the problems through meticulous and accurate selection test of most suitable 
methods and parameters for a STDM. The deSNN has the most powerful capability 
to cope with the spatiotemproal data in the term of accuracy [44]. NeuCube is based 
on brain-like neural networks and aims at solving the problems from the viewpoint 
of pattern recognition and classification [38]. NeuCube could be applied to vehicle 
monitoring, detection, and object recognition [49]. 


4.4.3 Deep Learning 


Machine learning (ML) is a type of artificial intelligence (AI) that provides com- 
puters with the ability to learn without being explicitly programmed. It is said 
that perceptron (IBM704) in 1957 is the beginning of the era of machine learn- 
ing [66,82]. In the following years, the algorithms such as delta rule, also called the 
least mean square (LMS) method [84], XOR logic function [66], automatic differ- 
entiation(AD) [31], multilayer perception (MLP), decision tree [83], support vector 
machine [10], AdaBoost [23], and random forest [8] are thought as the milestones 
of the development of machine learning. 

Deep learning was thought starting from restricted Boltzmann machine (RBM) 
in 1986 [20] and has been further developed in 1995 by using convolution neural 
network (CNN) for handwriting recognition, which has been implemented with sev- 
eral rounds of convolution and subsampling; then, full connections and Gaussian 
connections have been applied to the neural network layers [28,57]. In 2006, deep 
belief network(DBN) [32] was successfully developed which has pushed the deep 
learning research greatly forward. 

The famous application of deep learning was AlphaGo [89] from Google Deep- 
Mind in 2016. The computer program to play board game Go (Weiqi) has defeated 
human professional Go players on the 19 x 19 board. AlphaGo’s algorithm is based 
on reinforcement learning which has the steps: (1) policy network, (2) fast rollout, 
(3) value network, (4) Monte Carlo tree search (MCTS). 

Traditional machine learning is “shallow” not so “deep” which is based on training 
dataset (labeled) and test dataset, feature extraction from feature engineering (e.g., 
SIFT, HoG, etc.), classifier selection (e.g., SVM, AdBoost, etc.), and evaluations 
of classification results (e.g., precision, recall, ROC, AUC, etc.). “Shallow” (linear) 
classifier operating on raw pixels could not possibly distinguish difference and output 
different objects in the same category. “Shallow” classifiers require a good feature 
extractor that solves the selectivity; namely, those are selective to the aspects of 
the image that are important for discrimination, but invariant to irrelevant aspects 
[42,52,57,93]. 
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Deep learning has powerful ability of nonlinear processing using a cascade of 
multiple layers for feature transformation and end-to-end learning. A deep learning 
architecture is a multilayer stack of simple modules, and it computes nonlinear input- 
output mappings. Deep learning has to experience the steps such as feature map- 
ping, pooling, etc. Currently, the typical deep learning methods include [1,57], RNN 
(LSTM [33], GRU) [34], R-CNN [26], fast R-CNN [27], and faster R-CNN [80], 
YOLO, YOLO9000 [78], SSD [62], etc. 


e ConvNets(CNNs) 
ConvNets(CNNs) include the steps: local connections, shared weights, pooling, and 
the use of many layers; meanwhile, RNNs can be seen as very deep feedforward 
networks in which all the layers share the same weights [34,56,57]. 

In CNNs, Vx}, X2.--,X, E Rt, 


y= Wht; (4.97) 
h® = gwx + b®) (4.98) 
where g(-) could be: 


e ReLU function: g(z) = max(0, z) 


e Logistic function: g(z) = he 
e Tanh function: g(z) = ae 
e Sigmoid function: g(y) = I 
eae Aà a — _e©xpi) 
e Softmax function: g(y;) = softmax(y;) = Lepo) 


Given (Ay?) mxm at level k, g(-) is a nonlinear function, and a convolutional oper- 
ation (3 x 3 or 7 x 7) is, 


k+1 k k k k k k k k 


For average pooling with downsampling, 


7 1 k k k k 
WY = 7a. ne +b A, +0) WO +d nO 5,4) (4100) 
For a maxpooling with downsampling, 
k+l) _ O.O pk). pO) O, p® k) p(k) 
We? = max(a™ hN, bO ne p O hO dO - WY 41) 100) 


A loss function is, 


J (0) = —& log Po |x) (4.102) 
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or, 
J(0) = -&llly —f œ, 17] (4.103) 


where &(-) is the expectation, 


0 = arg minJ (0) (4.104) 
6 
Hence, 
dJ (0 
) o (4.105) 
dé 


Other loss functions include: 


OY=f(X) 
Square loss function: L(Y, f (X)) = (Y — f X)? 
Absolute loss function: L(Y, f (X)) = |Y —f(X)| 
Logarithm loss function: L(Y, p(Y|X)) = — log p(Y |X) 
Average loss function: L = tE L(x, yi), where the set T = {(x;, yD} (i = 
1,2,...,m) is the training dataset. 


e 0~1 loss function: L(Y, f (X)) = | LY #f(X) 


CNN has been applied to image noise removal [63]. Compared with traditional 
image denoising methods such as average filtering, Wiener filtering, and median 
filtering, the advantage of using this CNN model is that the parameters of this model 
can be optimized through network training, whereas in traditional image denois- 
ing, the parameters of these algorithms are fixed and cannot be adjusted during the 
filtering, namely lack of adaptivity. Meanwhile, the parallel processing ability of 
neural networks makes it possible for image denoising and speedup image denoising 
process. 

e Recurrent Networks 

RNNsS are a family of neural networks for processing sequential data. Most recurrent 
networks can process sequences with a variable length. From a dynamical system 
driven by external x, we have 


h® =f (hB, x; 6) = gO, xD, ..., x) (4.106) 
where t = 1,2..., T, h is the state. Recurrent neural networks produce an output 


at each time step and have recurrent connections between hidden units. For i = 
1, 20625 TF 


a® = b + Wh“ + Ux (4.107) 
h® = tanh(a) (4.108) 


o =ce4+V-h” (4.109) 
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jy = softmax(o™) (4.110) 


where b and ¢ are vectors, and U, V, W are weight matrices. The loss function L is, 


T T 
L= SEP = So log p(y {K™, ..., x0) (4.111) 


t=1 t=1 


Long short-term memory (LSTM) is a RNN model for the short-term memory 
which can last for a long period of time [24,25]. An LSTM unit consists of four 
gates: input gate, cell, forget gate, and output gate. LSTM is well suited to classify, 
process, and predict time series given time lags of unknown size and duration between 
important events. LSTM (memory) cell stores a value (or state), for either long 
or short time periods. LSTM gates compute an activation, often using the logistic 
function. LSTM was developed to deal with the exploding and vanishing gradient 
problem [33]. 


Ji = Og (Wy - xt + Up - hi + bp) 
ip = Og( Wi - xt + Ui - hy-1 + bi) 
Ot = Og(Wo + x; + Uo - hy-1 + Do) 
Ct = ft Ct—1 + it 0 Oc(We + xt + Uc + hi1 + be) 


hy = 0; 9 On (Cr) 


where x; and h; are input and output vectors, co = 0, ho = 0; ft, ir, and op are 
activation functions of forget, input, and output gates; W, U, b are weight matri- 
ces and bias vectors; c; is the cell state vector; ‘o° is Hadamard product, namely 
Amxn © Bmxn=(Gij)mxn o (bij)mxn = (aij > bij)mxn- 02(-), o-(-) and o7(-) are activa- 
tion functions. 

Gated recurrent unit (GRU) model based on the LSTM is a big change as forget 
gate is integrated with the input gate and turns them into a single update gate [33]. 
GRU is a relatively successful variant of LSTM, its number of parameters is smaller 
than average LSTM, and the model will converge earlier in the training. Also, the 
GRU does not require an initialization operation to achieve good results. For a fully 
gated unit, initially, t = 0, ho = 0, 


Zp = Og(Wz +x; + Uz - hi—1 + bz) (update gate) 
ri = Og(W, - Xt + U, + hy-1 + by) (reset gate) 


hy = On(Wh - Xt + Un(ri o hi1) + bn) (new memory) 


h = (1 — z) ok-1+z:0h; (hidden state) 
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where x; and h; are input and output vectors; W, U, and b are weight matrices and 
vector; ‘o° is the Hadamard product; og (-) and on(-) are sigmoid and tanh activation 
functions. 

e Recursive Networks 

For recursive neural networks, 


h® = wn) (4.112) 
As the power methods in recursion, 
h® = (W)h (4.113) 
If eigenvalues of W exist, then 
W = QAQ? (4.114) 


If Q is orthogonal, then 
h® = QTA Qh® (4.115) 
where A = diag(\1,...,Am), Ai Æ 0, when |A;| < 1, RNN is contractive. 


e R-CNN 

R-CNN [26] refers to region-based convolutional neural networks which generates 
potential bounding boxes in an image and runs a classifier on these proposed boxes. 
Post-processing is used to refine the bounding boxes so as to eliminate duplicate 
detections and rescore the boxes based on other objects in the scene. 

The biggest problem of R-CNN [26] is that the training time and test time are very 
long because it needs to get 1000 ~ 2000 proposals first and save them to disk, and 
also these proposals need to be calculated in all the former layers which need lots 
of repetitions. In addition, the fully connected layer is expected that all the vectors 
will have the same size, so all the proposals need to be resized using crop or wrap; 
both strategies are not suitable because the crop may cause that the proposals are not 
fully extracted and the wrap could change scales of objects. 

Fast R-CNN [27] overcomes several problems of R-CNN. R-CNN decreases the 
consumptions of time and space. What fast R-CNN has done is to replace ROI pooling 
layer using the pooling layer 5, and softmax function is applied to classification. 
The softmax is one extension of logistic regression to the multiclass classification 
problem. 

Faster R-CNN [80] was proposed to improve the training speed of the fast R- 
CNN. From R-CNN to faster R-CNN, the four steps of object detection are finally 
unified into one network. Faster R-CNN does not use selective search to get region 
proposals. Instead, it takes use of a region proposal network to carry out the same 
task. There has not repetition, all the calculations are performed by using GPUs. 
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Fig.4.32 Screenshot of the deep learning software: Caffe 


e SSD 

Single Shot MultiBox Detector (SSD) [62] is very similar to the faster R-CNN and 
simultaneously produces a score for each object category in each box. It skips the 
proposal step and predicts bounding boxes and confidences for multiple categories 
directly. SSD uses default boxes of different aspect ratios on each feature location 
from multiple feature maps at different scales. 


e YOLO 

YOLO [77] is one of the fast object detectors with regression, 45 frames per second, 
its mAP is up to 57.9%. YOLO creates an S x S grid cells; each cell will be respon- 
sible for the object which falls into the cell. Every cell will predict the bounding box 
and the confidence score of this box. For evaluating YOLO on VOC (Pascal VOC 
datasets), a 7 x 7 bounding box and 20 labeled classes are defined, which means 
it only extracts 98 proposals. YOLO is faster than R-CNN [26] which needs 2000 
proposals. YOLO9000 [78] is a real-time framework for detection more than 9000 
object categories by jointly optimizing detection and classification. 

YOLO has been applied to flame detection in surveillance [87]. YOLO used the 
whole image instead of a regional proposal to train and test. When compared object 
detection to a real-time model, YOLO has an overwhelming advantage. When fire 
flames have entirely different color features compared to the training set, shallow 
learning may have difficulties to detect them. However, deep learning demonstrates 
its superior performance in this case; it is not influenced by the changes of flames, 
thanks to its merits of the fine-grained adaptivity. 

An example of object classification of deep learning from Caffe [40] Demos is 
shown in Fig. 4.32. In this example, a photograph uploaded to the Web site has been 
analyzed and explained with semantic concepts. 
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e Problems of Backpropagation 
SGD refers to stochastic gradient descent in deep learning. The loss function of 
SGDis J (0) = L(fa (xi), Yi); Œi, yi) are samples (i = 1, ..., m); w.r.t., 0 in 4 = 0, 


f] 
6; :=0;— a- —J(6;); Vj = 0,1,2,..., n; Yi =1,2,...,m. 
` 00; 


where a is learning rate. 

For the global optimization, there are saddle points existing in the backpropaga- 
tion procedure [28]. To find these points, gradient vanishing and gradient exploding 
problems have to be overcome. The usual ways to solve these problems are through 
hierarchy of networks, restricted Boltzmann machine (RBM), generative models, 
long short-term memory (LSTM), and residual networks (ReNets) or redesign the 
network using gradient clipping, weight regularization, etc. 

Regularization is defined as any modification we make to a learning algorithm that 
is intended to reduce its generalization error but not its training error. The regularized 
objective function is, 


Î (0; X, y) =J@;X, y) +a- 20) (4.116) 


where a € [0, oo] is a hyperparameter or regularization rate; 6 denotes all of the 
parameters. The optimized parameters 0* are obtained by using 


6* = arg minVgJ (6; X, y) (4.117) 
0 


Regularization is helpful to reduce overfitting. Lz regularization and dropout are 
two very effective regularization techniques. 


1 
2(0) = slwl (4.118) 
Thus, 
Î(w:X, y) =J(w;X,y)+ SWW (4.119) 
The gradient, 
Vwd (w; X, y) = VwJ (w; X, y) +a -w (4.120) 


To update the weights, 


w < w-— e£- VyÌ(w;X,y),£ € (0, 1) (4.121) 
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Fig.4.33 Metadata of a Video 
video with audio track Length 00:00:16 
Frame width 1920 
Frame height 1080 
Data rate 17590kbps 
Total bitrate 17719kbps 
Frame rate 30 frames/second 
Audio 
Bit rate 128kbps 
Channels 2 (stereo) 


Audio sample rate 48 kHz 


4.5 Metadata of Surveillance Data 


Metadata is not the data of the object but relates to that object, such as file size, time 
stamps, attributes, and GPS information. Metadata is not the file content; it is called 
data of data. Meta attributes include time stamp, file index, tags, file size, etc. 

For a photography, we could find the EXIF items, and the items include camera 
ID, camera maker (who made the camera), camera model, exposure time for this 
photograph, focal length, ISO speed, flash mode, saturation, sharpness, light sources 
for the camera to take the photograph at that moment. An example of metadata of a 
video with audio track is shown in Fig. 4.33. 


4.6 Questions 


Question 1. What are the key issues in object recognition? 

Question 2. What components are included in a visual feature vector? 
Question 3. Please explain the below concepts in object recognition. 
(1) Training set and test set 

(2) Ground truth 

(3) Detector 

(4) Classifier 

(5) Accuracy, precision, and recall 

(6) Confusion matrix 

(7) ROC curve 

(8) F-measure and G-measure 

Question 4. Please explain k-nearest neighbors algorithm (k-NN). 
Question 5. Please explain the simplest form of Bayes’ rule (theorem). 
Question 6. What are the differences between deep learning and shallow learning? 
Question 7. Why Gabor transform is important in computer vision? 
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Biometrics for Surveillance 


5.1 Introduction to Biometrics 


Biometrics [63,64] comprehends fingerprints [110], hand geometry [25,112], ear- 
lobe geometry [121], retina and iris patterns, voice waves, DNA, signatures, etc. [3, 
7,89]. The biometric information is captured from human physical body and natural 
behaviors such as gait, keystroke, voice, lipreading [118], human signature, [114], 
etc. In general, everybody has his or her own biometrics but different individuals 
absolutely have clearly distinct computable features. The covering and discrimina- 
tive attributes are the basestone what we should study for biometrics. In the following, 
we will detail each biometric one by one. 

Human Face. Face recognition is one of the most important biometrics in com- 
puter vision [14,71]. It has been broadly used in fields such as surveillance, informa- 
tion security, identification systems, and law enforcement systems. In face recogni- 
tion, the system needs to firstly detect a human face and then recognize the face using 
a classifier such as SVM assisted by a training set [102]. The classical algorithms 
comprise principal component analysis (PCA) [149], linear discriminant analysis 
(LDA), etc. [8,23,31,33,36,86]. Automated face analytics such as face detection, 
face recognition, and facial expression recognition are useful in recent security and 
forensics. 

In PCA algorithm, we need to calculate eigenvalues [73, 132, 149], 


X-'AX = A = diag{Ay, Az, --- , An} 
where 0 < Aj < à2 <--- < Àn are roots of the Eq. (5.1), X is an eigenvector matrix, 


fa) =90 (5.1) 
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where f (A) is the characteristic polynomial, 


f(A) = det(A — 14) = Daj! (an # 0) 


i=0 
For example, if a matrix is, 
13 
T 
then 
(A) = det(A — I - à) = ted Neat à): (2-1) =0 
Hence, 
Ay = l; à2 = 2; 


Face recognition is a nonintrusive method, facial images are probably the most com- 

monly used biometric characteristics to make out personal identity. The applications 

of face recognition range from a static controlled “mugshot” authentication to a 

dynamic, uncontrolled face identification in a cluttered background [21,49, 132]. 
For the details of face recognition: 


e Location and shape of facial components, such as eyes, eyebrows, nose, lips and 
chin, and the spatial relationships, are shown in Fig. 5.1, respectively [3,6]. 

e Security authentication systems require a fixed and simple background or special 
illumination. These systems also have difficulties in matching face images cap- 
tured from two drastically different views. It is questionable whether the human 
face, without any contextual information, is a sufficient basis for recognizing a 
person from a large number of identities with an extremely high level of confi- 
dence. 

e A face recognition system should be automated; an example of human face detec- 
tion based on OpenCV (http://opencv.org/) is shown in Fig. 5.2. 


However, as one of the object recognition problems in digital image processing, face 
recognition still suffers from problems such as luminance changes, pose changes, 
making-up, complex environments, head rotation, and aging issues [46, 106, 108]. 
Most of these problems are still under investigation. 

OpenCV is a software platform in computer vision, and human face detection 
and parts detection have been well developed; the training results are available for 
detecting human mouth, left eye and right eye, noise detection, etc. [6]. OpenCV 
including source code could be downloaded from http://opencv.org. 

OpenCV adopts the cascade face detection algorithm which is extremely fast with 
efficient feature selection; the face detector is a scale and location invariant detector 
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Fig.5.1 Detection of human facial parts using OpenCV 


which has been well trained. Instead of scaling the image itself (e.g., pyramid filters, 
etc.), it scales the features [135]. 

In face detection, Viola—Jones algorithm could be applied to detect other types of 
objects such as cars and hands. The algorithm takes use of Haar feature selection, cre- 
ates integral image as the main feature, and adopts AdaBoost training algorithm. The 
salient contrast at the eye region and nose bridge region of human face is assumed as 
the feature for the human face detection adopted for AdaBoost training. The integral 
image has been applied to multiresolution-based face region search [141]; hence it is 
hierarchal. The cascade classifier takes out the regions with failure detection; those 
only successfully passed the classification will be remained for further consideration. 
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Fig.5.2 OpenCV for human face detection 


Since the detected regions will be merged together after the detection, in OpenCV 
applications or mobile Apps [4], we only see one rectangle of the overlapped regions. 

Meanwhile, Viola—Jones algorithm is most effective one only on frontal images 
of a given face; it can hardly cope with 45° face rotation around both the vertical and 
horizontal axes [145]. It is sensitive to lighting conditions; we might get multiple 
results of the same face due to overlapping sub-windows [2]. 

Fingerprint. Our human uses fingerprints for personal identification in long 
history, and the matching accuracy using fingerprints has been proved extremely 
high [89]. A fingerprint is the pattern of ridges and valleys on the surface of a finger- 
tip. Fingerprints of identical twins are different and so are the prints on each finger 
of the same person. Accuracy of currently available fingerprint recognition systems 
is adequate for security authentication. Multiple fingerprints of a person provide 
additional information to allow for large-scale identification involving millions of 
identities. 

Iris. Iris is annular region of our eye bounded by the pupil and sclera on either 
side. Visual texture of the iris is formed during fetal development and stabilizes 
during the first two years of life [3]. The complicated iris texture carries very dis- 
tinctive information for personal recognition. The iris-based recognition systems are 
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promising and point to the feasibility of large-scale identification systems. Each iris 
is believed to be distinctive like fingerprints, even the irises of identical twins are 
expected to be different. It is extremely difficult to surgically tamper the texture 
of an iris. Although the early iris-based recognition systems required considerable 
user participation and were expensive, the newest systems have become more user- 
friendly and cost-effective. While iris systems have a very low false acceptance rate 
(FAR) compared to other biometric traits [129], false reject rate (FRR) of these 
systems could be very high [131]. 

Keystroke. This behavioral biometrics is expected to offer sufficient discrimina- 
tory information that permits identity verification. Keystroke dynamics is a behav- 
ioral information; for individuals, one may expect to observe large variations in typi- 
cal typing patterns [62]. Furthermore, the keystrokes of a person could be monitored 
unobtrusively as the person is entering. However, this biometric permits “continuous 
verification” of an individual over a period of time. 

Signature. The way a person signs his or her name is known to be characteristic of 
that individual. Signatures require contact with the writing instrument and an effort 
on the part of users that has been accepted in government, legal, and commercial 
transactions as a method of authentication. Signatures are a behavioral biometrics 
influenced by physical and emotional conditions of the signatories. Signatures vary 
substantially: Even successive impressions of the signature are significantly differ- 
ent [114]. Furthermore, professional forgers may be able to reproduce signatures that 
fool the system. 

Voice. Voice is a combination of physical and behavioral biometrics [93]. The 
features of an individual’s voice are based on shape and size of the appendages (e.g., 
vocal tracts, mouth, nasal cavities, and lips) that are used in synthesis of the sound. 
These physical characteristics of human speech are invariant for an individual, but 
speech of the same person changes over time due to aging or medical conditions, 
emotional state, etc. Voice is also not very distinctive and may not be appropriate for 
large-scale identification. A disadvantage of voice-based recognition is that speech 
features are sensitive to a number of factors such as background noise. Speech recog- 
nition is most appropriate in phone-based applications but the voice over phone is 
typically degraded in quality due to digital communication [17]. 

Lipreading. Lipreading is the process of observing lip movements of a speaker 
with the aim to interpret speech, especially when recorded voice may not be available 
or full of noises [29,118]. The detection of lipreading is to find the approximate 
location of the lip in motion picture sequences. It is dependent on face detection to 
position lips. Usually, a color image in RGB color space is used to determine where 
the lip pixels are and how the clips are moving in spatiotemporal domain [3]. 

The features of lipreadings contain contours based on edge and position of the 
keypoints of moving lips and outward shapes or texture of lips. Active appearance 
model (AAM) is usually employed in order to extract the internal and external lip 
contours. The AAM comprises 12 points on the internal lip contour and 16 points 
on the external lip contour; meanwhile, viseme grouping has been highlighted in 
lipreading as well [41]. 
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Lip recognition [41] adopts the methods such as template matching, hidden 
Markov model (HMM) [16, 150], time-delayed neural network model, self-organizing 
map neural network (SOMNN) model, and mixed recognition methods. Because a 
simple recognition method has low recognition rate, a plenty of lipreading systems 
are to mix a variety of methods together so as to improve the precision and recall. The 
combination of HMM and ANN recognition methods has been proposed to achieve 
a better result [67,77,96, 127]. 

Biometrics have the unique features from other evidences which have been widely 
applied to digital surveillance and forensics. Biometrics have been employed to 
identify and authenticate after authorization [15]. Identification means whether the 
biometric information is from the same person. After identification, we authenticate 
the person authorization to access the specific domain. 

On the other hand, biometric is individual and personal based; therefore, it has 
ethics and privacy problems; permissions must be authorized before any public or 
official usages [42]. The biometrics of children, women, and disabled are extremely 
emphasized [13]. 


5.2 Biometrics Characteristics 
5.2.1 Measurement and Analysis 


Human bodies have different features which are measurable and computable, such 
as lip motion, gait and gesture, and body or sign language [5]. The characteristics 
help special individuals in special environments, such as military, noisy stadium, or 
quiet required spots. 

Biometrics have computable features which are applied to find the unique person, 
namely identification. Nowadays, most passports are with indispensable biometric 
information for border check in airports or railway stations which is a kind of popular 
authentication. The features of biometrics should be discriminative and covering in 
pattern recognition. We observed that the biometrics have the following features: 


1. Convenient and ubiquity. We usually bring biometrics anywhere without special 
requirements. The biometrics are with our human body essentially and individu- 
ally. 

2. Automatically available and measurable. Biometrics are available and measur- 
able without harmfulness and are used for calculating once the features are cap- 
tured and ready for use. 

3. Unalterability, unreplaceable or unchangeable. Most of human biometrics will 
accompany with us forever after our birth. Even though we are growing up or 
becoming old, the biometrics will not be altered. 
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The pattern of a fingerprint minutiae forms a valid representation of fingerprint [89]. 
This representation is compact and captures a significant component of information 
in fingerprints; compared to others, minutiae extraction is relatively robust to various 
sources of fingerprint degradation. Most types of minutiae in fingerprint images are 
stable and can be reliably identified automatically. The most widely used features 
are based on ridge ending and ridge bifurcation. 

Given a representation scheme (e.g., minutiae distribution) and a similarity mea- 
sure (e.g., string matching), there are two approaches for determining the individu- 
ality of fingerprints. Among empirical approaches, representative samples of finger- 
prints are collected, and a typical fingerprint matcher is used for calculating simi- 
larity; accuracy of the matcher on samples provides an indication of the uniqueness 
of fingerprint with respect to the matcher. However, there are known problems and 
costs associated with the collection of representative samples. 

In police station, if policemen want to search a person, usually they use the com- 
putable features. Namely, if a person is passing by a camera, most of his information 
will be recorded and stored into a computer for searching in real time. 

A majority of applications of biometrics are face detection and recognition. The 
results have been used in mobile technology [4]. From an obtained human face, a 
computer could infer the person’s age [130], ethics, gender, skin [72], hairstyle, etc. 
If the computer has a watch list, biometric [40] information will help us find out the 
relevant records for verification and identification [15]. 

In biometrics, the challenges in research include acquisition conditions (illumina- 
tion, expression), additional variations (disguises, occlusions), and aging. In indoor 
environment, the biometric information is acquired very easily, but in outdoor, the 
information will be changed with illumination and weather conditions [99]. 


5.2.3 Palm Print 


Palm print refers to the area from fingertips to the wrist palm; it features from 
wrinkles, ridge ending, triangular point, etc. Palm print has bigger area and texture; 
we could use easily available device to capture the resources. On the other hand, 
using palm print for people identification is much robust and reliable. 

There are four different types of methods which are used for palm print analysis, 
namely structure-based methods, statistics-based methods, subspace-based methods, 
and coding-based methods. In the early stage, structure-based methods mainly used 
direction and position of the principle line and wrinkles on a palm achieve the recog- 
nition. Statistics-based methods are using mean and variance of palm prints, etc., 
as the features to identify the local statistics and global statistics [146]. Subspace- 
based methods mainly include independent component analysis (ICA) [86], principal 
component analysis (PCA), and linear discriminant analysis (LDA). Coding-based 
methods using Gabor wavelet and Fourier transform have higher accuracy and faster 
matching in speed which are regarded as the best method among these four [137]. 
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Palm print recognition as one of the biometric recognitions has been developed 
for decades [137, 146, 147]. The achievements from palm print recognition have also 
made a significant progress, especially in the contactless approaches and algorithms 
related to palm print recognition [151]. The methods for evaluating palm print recog- 
nition are true acceptance rate (TAR), false rejection rate (FRR), false acceptance 
rate (FAR), and discriminating index [7], which also include feature size, feature 
extraction, matching time, etc. 

The features of palm prints are categorized into geometrical feature, principal 
lines, wrinkle, delta points, and minutiae. Specifically, geometrical features represent 
the characteristics of palm such as width and length of a human palm. Principal lines 
are evident and primary lines in the palm such as lifeline, love line, and heart line. 
Wrinkles are the other lines in palm, excluding the principal lines. The definition of 
delta points is a special district in the center of a palm. 


5.2.4 Human Face 


The state of the art of biometric technology for face detection is using AdaBoost 
which seeks the best approach for human face detection and recognition. The typi- 
cal algorithm for biometrics is principal component analysis (PCA) [8], linear dis- 
criminate analysis (LDA) [84], Fisher’s linear discriminate [85], and elastic graph 
matching (EGM). 

AdaBoost is adaptive by iteratively reducing misclassifications. The weights of 
misclassified descriptors are adjusted for the benefit of generating a combined appli- 
cation of the given weak classifiers at the end. 

Linear discriminant analysis (LDA) and the related Fisher’s linear discriminant 
are methods used to find a linear combination of features which characterizes or 
separates two or more classes of objects. The combination may be used as a linear 
classifier or more commonly for dimensionality reduction before later classification. 
LDA is closely related to the analysis of variance and regression, which also expresses 
one dependent variable as a linear combination of other features or measurements. 

In biometrics, PIE refers to pose, illumination, and expression for live recogni- 
tion. 3D morphable face model is for independent pose recognition, low-dimensional 
spherical harmonic representations, and illumination-insensitive face recognition. 
Currently, the multi-PIE face database from Carnegie Mellon University in Pitts- 
burgh is the most extensive one in terms of systematic variation of imaging param- 
eters, including expressions, and therefore is the most suitable dataset. Other well- 
known databases include Yale Face Database, CMU FIA database, MIT CBCL 
database, NIST Mugshot Identification Database, etc. (http://www.face-rec.org/ 
databases/) [106, 108]. 

For human face detection, two issues should be taken into considerations: (1) 
reduction in data dimensionality because of the famous curse of dimensionality 
problem and (2) feature selection. The typical algorithm of face detection could be 
listed as below: 
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Step 1: Prepare the data. 

Step 2: Subtract the mean of the overall faces. 

Step 3: Calculate the covariance matrix. 

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix. 

Step 5: Select the principal components. 

Step 6: Compare the faces reconstructed using the selected principal components. 


In Viola—Jones face detection, we observed that the eye region of our face is darker 
than the upper cheeks; this has been reflected in Haar feature which is applied to face 
detection. 

Generally, we have the following four steps for face detection using the Haar 
feature: (1) Haar feature selection, (2) creating integral image, (3) AdaBoost training 
algorithm, and (4) cascaded classifiers. 

The advantages of Viola—Jones face detection include (1) extremely fast feature 
computation; (2) efficient in feature selection; (3) scale and location invariant detec- 
tor; (4) instead of scaling the image itself (e.g., pyramid filters), we scale the features; 
and (5) we could use it to detect other types of objects such as cars and hands. 

The disadvantages of Viola—Jones face detection are thought as: (1) The detector 
is most effective only on frontal images of faces; (2) it can hardly cope with 45° face 
rotation around both the vertical and horizontal axes; (3) it is sensitive to lighting 
conditions [2]. 


5.2.5 Gender Profiling 


Although it is a fundamental task for surveillance applications to determine gender, 
however, normal video algorithms for gender profiling (usually face profiling) have 
three drawbacks. (1) The profiling result is always uncertain. (2) For a time-lasting 
gender profiling algorithm, the result is not stable. The degree of certainty usually 
varies, Sometimes even to the extent that a male is classified as a female and vice 
versa. (3) A robust profiling results in the cases that a person’s face is invisible; other 
features, such as body shape, are required. These algorithms may provide different 
recognition results at the very least which will provide different degrees of certainties. 
Dempster-Shafer (DS) theory is a popular framework to deal with uncertain or 
incomplete information from multiple sources. This theory is capable of modeling 
incomplete information through ignorance. For combining different pieces of infor- 
mation, DS theory distinguishes two cases, i.e., whether a piece of information is 
from distinct or nondistinct sources. Therefore, gender profiling results from the same 
classifier, e.g., face-based, are considered from nondistinct sources, while profiling 
results from different classifiers are naturally considered as from distinct sources. 
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5.2.6 EBGM 


Elastic bunch graph matching (EBGM) is an algorithm in computer vision for rec- 
ognizing objects from an image based on a graphical representation. It has been 
prominently used in face recognition but also for gestures and other object classifi- 
cations. EBGM is defined as an optimization problem of two-dimensional warping 
specifying corresponding pixels between subjected images. 

EBGM is an extension to elastic graph matching for object classes with acommon 
structure, such as faces in identical pose [55]. All instances of such a class are 
represented by the same type of graph. From these graphs, a bunch graph of the 
same structure is created with the nodes representing local textures of any objects 
in the class, e.g., all variants of a left eye, and the edges represent average distances 
between the node locations, e.g., the average distance between the two eyes. Thus, 
a bunch graph is an abstraction for representing object classes rather than individual 
objects. 

EBGM is only applied to objects with a common structure, such as faces in 
frontal pose, sharing a common set of landmarks like the tip of a nose or a corner 
of an eye [107]. For recognition of arbitrary objects, in the absence of landmarks, 
the graphs are required to be dynamic with respect to both shape and attributed 
features. A graphic dynamics that allows generic object representations, model or 
bunch graphs, emerges from a collection of arbitrary objects. The idea is to extract 
typical local texture arrangements from the objects and provide the rules to compose 
them as needed to represent new objects. 


5.2.7 Twin Recognition 


Nowadays, biometrics apply behavioral characteristics to identify individuals. 
Among the identifications, twins are very tough to be identified because they resem- 
ble each other very much [10,88, 134]. 

Human visual system can work better than computational machines in twin recog- 
nition. Human is good at finding facial marks such as moles and scars so as to effec- 
tively identify twins [126]. In addition, if human beings have much time to examine 
the difference carefully, the performance will become much better [12]. 

Recently, twin recognition in association with human earmark technology becomes 
a new class of relatively stable biometrics that has drawn attention of a number of 
researchers. Human ear recognition as a biometrics does not have significant changes; 
therefore, it could be regarded as an effective biometrics [30]. As other pattern recog- 
nition methods, ear recognition has its advantages and disadvantages. Compared to 
face recognition, ear recognition is seldom affected by human emotions and aging. 
Methods of ear recognition have twofold: statistical-based method and geometric- 
based method [75]. 

Statistical-based methods analyze a human ear image by using statistical 
tools [121]. The ear image is treated as a matrix, and then, the method like PCA 
is taken to deal with the features and reduce the redundancy of the data. PCA as 
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(d) 


Fig.5.3 Results of human behavior recognition 


a statistical-based method was developed to extract features. The related work also 
analyzes performance with the changes in human aging, illumination, and pose [78]. 
In addition, features of human faces and ears are extracted by using PCA. 

Geometric-based methods take use of the shapes of human ear to identify twins. 
There are two very convenient methods, concentric circle method (CCM) and con- 
tour tracing method (CTM). The center point of a human ear is its center of mass, 
concentric circles are created by this point and predefined radius. Thus, the intersec- 
tions of these circles and ear edges are thought as keypoints of the ear. CTM is the 
method to trace the ear edges so as to find the contour of bifurcation, intersection, 
and ending. In the work, CCM is found better than that of CTM [70,92]. 


5.2.8 Pedestrians 


Pedestrians [39, 1 16] are normal moving objects on street who have the characteristics 
such as walking with legs, human body, and head. While a person is moving along 
the street, the scene and background also keep changing [87]. 

An example of pedestrian recognition is shown in Fig. 5.3. 

To achieve video classification of human exercising activities (walking, skipping, 
running, jacking, and jumping) [48,56,69], we consider visual feature extraction 
having spatial and temporal information between adjacent video frames. HOG is 
such a visual feature which is adopted for object recognition in computer vision. 
As we know HOG descriptors provide a better performance than others for human 
behavior detection and recognition. 

HOG features for human detection have been trained and tested after normaliza- 
tion, gradient computation, and spatial organization [32]. The main idea of HOG 
descriptor is to calculate the occurrences of gradient orientation in localized por- 
tions of an image. The implementation of HOG descriptors has been achieved by 
segmenting the image into small connection regions which are called “cells.” Each 
cell generates a histogram of oriented gradients or cell edge direction of the pixel; 
the combination of these histograms can be applied to express descriptor. 

HOG descriptor has the advantages which effectively describe local shape of the 
image. By using the number of bins and cell size of the histogram, HOG is able to 
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reflect an image character of the local region, to describe precise feature information, 
and to keep features invariant. 

Local binary pattern (LBP) is an operator to describe the local texture feature of 
an image. The advantage of this operator is the characteristic of rotational invariance 
and grayscale invariance. Therefore, LBP is a simple and effective algorithm for 
feature extraction. The extraction of LBP depends on the grayscale. For one target 
pixel, there are more than eight pixels around it; if the grayscale level of surrounding 
pixels is greater than the grayscale level of the pixel in the middle, they will be marked 
by ‘1.’ On the contrary, they will be marked by ‘0.’ The eight-bit binary number will 
be the LBP feature of the pixel in the central. For LBP feature extraction in practice, 
the original image will be divided into a number of cells with a fixed size (16 x 16). 
After that, we apply this method to obtain the eight-bit binary number and then 
calculate the histogram for each cell and applied normalization to the histograms. 
Finally, we compose a feature vector of the whole image by using the histogram as 
the vector elements. 

In pedestrian detection [153], we usually need to locate an object candidate, 
provide specific split function, and outline a post-processing step [116]. Localizing 
regions of interest (ROT) is to identify potential candidates for bounding boxes by 
passing them through the trees of RDF (Random Decision Forest). 

The results of classification could provide more than one candidate window around 
a person in an image. For tracking and positioning [53], it is meaningful to merge 
them into one. Each positive window has the corresponding probability assigned 
by using strong classifier generated by RDF; the one with the highest probability 
is chosen. Mean shift could be applied to specify the final detected region of an 
object [47]. 


5.2.9 Suspects 


If we put human biometrics: face, hair, motion, and actions together, we could find 
a wanted person automatically in police station by using biometrics. On the bulletin 
of a wanted person or social networks, usually the suspect’s information is shown to 
the public. The suspect’s features include gender, age, nationality, language, height, 
weight, color of eyes, color of hair, length of the hair, eyebrows, nose, mouth, etc. 
These features could be calculated from surveillance videos and images by using 
the algorithms provided in this chapter. Once the features have been saved in a 
database, it helps us search for the wanted person in airports, railway stations, and 
ferry terminals. 

As the safety of individuals is always one of the main foci of public security, it 
appears to be crucially significant for tracking suspects under surveillance. However, 
it is difficult for suspects being monitored exclusively by human beings. With the 
rapid development of technologies in intelligent surveillance, officers started identi- 
fying and detecting traces of suspects by using intelligent surveillance systems [43]. 
Specifically, there are two main parts in suspect identification: human feature extrac- 
tion and feature matching for assigned suspects [15]. Well-known techniques related 
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to feature extraction are feature point detection, edge detection, scale invariance 
feature transform (SIFT), and representation of colors in computer vision [71]. 


5.3 Human Actions, Behaviors, and Gestures 


Surveillance is a reality; human actions include walking, jogging, running, boxing, 
waving, clapping, etc. [60,60, 142]. Actions and behaviors are closely related to time 
line and have life-span. We therefore could draw our behaviors in a spatiotemporal 
volume; trajectories of an object also could be used to present human behavior and 
actions [74,94]. 

Gesture recognition is composed of multiple technologies and techniques that 
arise from the use of physical input of human body or hand movement [34, 115, 139] 
and later on identifying gestures [80] through cameras without the need of a physical 
input device [5,11,24,95]. 

There are two types of gestures: one is static gesture, which has the abilities to 
recognize a shape or pattern of a hand [25,59, 104,140], and body limb or facial 
recognition; the models are used including template matching, neuronal networks, 
or pattern recognition techniques [128]. 

Another gesture recognition type is dynamic recognition [44, 152]. This is a series 
of postures being recognized over a short period of time; in this type of recognition, 
each video frame composes the posture; the sequence of video frames defines the 
gesture to interact with the computer [79, 125]; dynamic recognition techniques are 
used such as time compressing templates, dynamic time warping, hidden Markov 
(HMM) models [16], deep learning [80], conditional random fields (CRF), time delay 
neural networks, and particle filtering [58]. Finite state machine (FSM) also could 
be set up for detecting human gestures. 

Human actions and behaviors have mean and variance [60,61]. Mean is available 
from the average of human behavior after repetitions; meanwhile, variances could 
tell us how far each behavior or action is from the mean. 

The typical method of human actions and behavior analysis is moment; we use 
Mahalanobis distance to find the difference; we could pick the nearest one. Usually, 
we calculate the moments in each action class with a Gaussian distribution (diagonal 
covariance) and then measure the Mahalanobis distance d = y (x — y)TS-!(x— y) 
to all classes; finally, we pick up the nearest one. Recently, deep neural networks and 
sequential tensor decomposition have been applied to human action recognition [53]. 

Optical flow is based on video motion estimation [45]. In MPEG, motion vectors 
which reflect the optical flow are stored in MPEG video. Using the motion vectors, 
we could estimate object or human motion and track the objects in surveillance 
videos [54, 138]. 

In static gesture recognition, the method made use of dynamic time warping 
(DTW) algorithms combined with k-NN classification. This combination was chosen 
because it could be implemented with ease and adapt to different users and varying 
gesture types [136, 148]. 
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Vision-based gesture recognition has been investigated for a long time, Fujitsu 
Laboratories completed the identification of 46 gesture symbols in 1991. A fingertip 
with a bright visual colored glove is applied to finger gesture recognition as an input; 
seven gestures can be successfully identified [82]. 

A static-hand recognition of international sign language was implemented by 
using principal component analysis (PCA). The methods are able to recognize 25 
hand gestures [100]. 

Support vector machine (SVM) is a classical classifier in machine learning. In 
finger gesture recognition, it was usually applied to decide when to start the fin- 
ger tracking by using a range of colors, shapes, and motions. The standard SVM 
algorithm was developed for pattern classification. 

Artificial neural networks (ANNs) simulate our human brain in pattern classifi- 
cation which train the network nodes deployed on multiple layers as the necessary 
elements for finger gesture recognition [98]. ANNs have the characteristics of pat- 
tern classification with anti-interference. An ANN network in gesture recognition 
has been developed by using a three-layer neural network, including 13 input nodes, 
100 hidden nodes, and 42 output nodes [40]. 

A sign gesture recognition system was developed by using recurrent neural net- 
works (RNNs) in deep learning; the system can recognize 42 symbols. 

After ANN algorithms were put forward to practice, it has been greatly improved 
and generalized, including replacement of error function, dynamic adjustment of 
network topology, learning rate, and factor parameters. The future development of 
ANNs can further reduce the complexity to enhance the extractability of ANN train- 
ing and the applicability of the algorithms [40,57,83]. 


5.4 Human Privacy and Ethics 


Privacy is human right to control what happens with personal data [19,20,52]. The 
meaning of privacy may differ throughout cultures, but the general concept is that pri- 
vacy means wanting to keep information unnoticed or unidentified from the general 
public [66,111]. 

Surveillance privacy protection (SPP) is a realistic [101] in the real world 
today [26]. The surveillance data carries confidential or privacy information which is 
related to personal secrets, children, relatives, location and time privacy, appearance 
and wearing, emotions and intentions, etc. [37]. Therefore, surveillance privacy does 
exist [76]. 

In publishing data privacy over World Wide Web, preserving data privacy usually 
refers to release more anonymous records, such as k-anonymity; meanwhile, we 
could keep the k records having /-diverse. The difference of these /-diverse records 
could have t-closeness in probability or entropy. Analogously, in surveillance privacy, 
the sensitive regions of a picture usually are obscured by mosaics or blurred mask 
to enhance anonymities. The masks may be calculated by using pixelization so as to 
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reach the effects such as /-diverse and t-closeness in histogram and entropy in digital 
image processing. 

Surveillance data privacy preservation is different from censorship systems of 
a state [91,113]. In censorship, a rating system usually classifies movies or TV 
dramas, computer games, and literature into several classifications: general audiences 
(G), parental guidance suggested (PG), restricted (R), etc. The media that contained 
objectionable, harmful, sensitive, politically incorrect, or inconvenient media will be 
banned by a government or an organization [22]. 

Due to privacy issue, the concern about the conflict between security and privacy 
for surveillance is on the rise [120]. Therefore, over the past few years, research work 
to protect privacy in surveillance systems has been made [90]. They have proposed 
various approaches for privacy-protecting goal including using distortion filters to 
pixelize, blur, blackout, silhouette, clear, replace by generic object, and mask the 
object containing sensitive information which may explore privacy [9, 123]. 

Privacy of surveillance video has been modeled by using the famous sigmoid func- 
tion. The parameters used in this model are “who,” “when,” and “where” information; 
this is the reason why the applications are from surveillance. The detected pedes- 
trian’s face/head of surveillance video is obscured by encrypting with a unique key 
derived from a master key for privacy preservation purpose. A privacy preservation 
method that adopts adaptive data transformation involving the use of selective obfus- 
cation and global operations to provide robust privacy was proposed. The approach 
intelligently hides evidence in a video without much compromise with quality [144]. 

Privacy of surveillance images may include human faces, number plate, street 
number, biometrics such as fingerprint and iris. Usually, these images are thought to 
have privacy information and should be taken with special care in media sharing [38, 
81,97, 124]. 


5.4.1 Pixelization and Gaussian Blurring 


Pixelization is a technique used for modifying images or videos for privacy; it is 
achieved by noticeably lowering resolution in ROI, using a square block of pixels 
with its average [28]. The primary purpose is to use for censorship. It is commonly 
used in television news to obscure the object containing sensitive information such 
as the proper name of people, locations, or any other inappropriate discourse [27]. 
The advantage of using pixelization in surveillance system for privacy protection 
is very simple and easy to integrate in existing system [27,28,123]. On the other 
hand, the disadvantage is that the process is irreversible and the privacy information 
is lost [122]. The pixelization of image J (x, y) is shown in Eq. (5.2). 


b—1 b-1 


Io =p D|) oti [$] 2+) (5.2) 
i=0 j=0 


where image pixel coordinates are x and y, block size is b, and |-| indicates the 
floored division. 
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Gaussian blurring is an approach widely used for privacy protection in surveil- 
lance; it removes details in ROI by using a Gaussian low-pass filter [9]. It is also 
taking advantage of edge detection [50]. The equation of Gaussian function in one 
dimension is shown in Eqs. (5.3) and (5.4). 


1 _ ow)? 53 
8 = ae (5.3) 


where jz is the mean, ø is the variance, x ~ N (u, 0). 
In multidimension, it is shown in Eq. (5.4): 


G(X) = SK e (5.4) 
y (27)*| Z| 
where |X| is the determinant of matrix X, k = |X|, X = (x1, x2,- , Xk) X ~ 


N (u, X). 


5.4.2 Audio Progressive Scrambling 


Audio information in surveillance [109] like human voice usually has been modified 
in pitches [65] by using spectrum in digital signal processing [93]; the content could 
be clearly heard, but the speakers could not be identified; voices of male and female 
or adult and children especially are swapped [133]. Progressive techniques have 
been used for MP3 encryption [143]. During descrambling, provided the number of 
keys, a number of rounds of descrambling performed will decide the audio output 
quality. With a subset of keys, the audio is descrambled to obtain a low-quality wave. 
However, an audio clip having original quality is able to be obtained by using all of 
the keys. 

Typical sounds may include privacy information such as footsteps, bath shower, 
toilet flushing, baby cooing, signature writing, and keyboard typing or keystroke 
into considerations and implement the progressive scrambling for the audio clips in 
spatial domain and frequency domain, respectively. Usually, these audio clips are 
thought to have the sensitive information and easily make a listener embarrassed. 

Current research results have revealed that new privacy enabling technologies 
(PET) are promising with the prospective to successfully protect individual privacy, 
without hindering surveillance tasks. The final results confront the common surmise 
that increased security may overcome a failure of privacy [106, 108]. 

The W? survived assessment model [1 17] is essentially the foremost and very use- 
ful step toward privacy protection of individuals in multicamera video footage [52]. 
This work does set up directions for future research, for instance, to investigate ways 
to lessen the privacy loss with the minimum loss of efficacy in video quality [105]. 
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(a) (b) 


Fig.5.4 Privacy preservation using: a mosaicking b Gaussian blurring ¢ scrambling 


5.4.3 Covering and Overlapping 


The traditional ways to preserve privacy in visual surveillance are mosaicking, pix- 
elization, or scrambling human face regions as shown in Fig. 5.4 [51]. But apparently 
this is not enough since from the acquired clothes, behaviors or gait, even from a 
contour, silhouette or blob, and several dots, we still are able to discern who this per- 
son is. Even if a face is not clearly seen from a very far distance, such as a basketball 
or soccer player, we are still able to infer who the person is. Particularly from those 
processed images, photos, or cartoon motion pictures, the similarity is still exis- 
tential. This rolls out the motivation that the best solution of privacy preservation 
is to completely remove the privacy information from the video frames. However, 
the commitment will undoubtedly diminish the utility and visibility of surveillance 
videos. 

Utility refers to video usage for various purposes. When an incident happens, 
we need to track back and search for the person and objects related to the incident. 
While conventional ways of privacy protection such as image mosaicking, blurring, 
and pixelizing easily provoke the content damaging, if the surveillance video frames 
have been completely obstructed, that is equivalent to acclaim that this video has 
to be cast aside. Thus, the situation requires us to find a way to leverage the utility, 
visibility, and privacy of surveillance videos. Our goal of this paper is to resolve this 
tough problem. 

Therefore, we hope to find an effective way for preserving privacy [103]. For 
instance, in a typically monitored corridor, we use a walking Mickey Mouse to 
substitute a man for displaying purpose who is walking through from left to right 
or from right to left, and the man may perambulate to pass this site; thereafter, the 
mouse will be viewed in the correspondingly sluggish way such as entering, walking, 
standing, existing, and alarming. If it indeed has an incident, namely the alarming 
state is activated, only the authorized security staff has the privilege to review the 
surveillance events, but normally this analogy-based replacement for the purpose of 
privacy preservation is much reasonable for catering to unauthorized viewers [18]. 

From our observations, we find that surveillance events have their own patterns 
owing to the merits such as discriminative and covering. We therefore have the 
opportunity to seek the typical motion pictures with a specific pattern, such as the 
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e 


Fig. 5.5 A state diagram and video frames show an instance of privacy preservation in visual 
surveillance 


cartoon GIF pictures which could be played iteratively and are suitable for presenting 
these surveillance events. Therefore, adjustment of these motion pictures is entailed 
to match the necessity of surveillance events. 


1 hl i)— hl i 2 
d(H, Hp) = yo — (5.5) 
where an action is represented by a set of nine one-dimensional histograms: hl, hl, Al, 
ie hy, h, he, i he i is the histogram of one component of space-time measurements 
N! le Lis total frie number of an image sequence. 


(5.6) 
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where (N,., l N;, l NY i is the normalization of the spatiotemporal vector (|S% ap IS; L |S; Li) 
which indicaies the motion of the video frame at the time t. 

We adopted the videos provided in the surveillance dataset: CAVIAR to demon- 
strate a walker passing through a shop in a mall. The state diagram with video frames 
depicts the typical events of a walker when passing through a monitored corridor: 
entering, Standing, passing, alarming, and exiting. The states could be switched 
between each other due to changes in the guard condition and actions. We detect 
the state changes; we find cartoon pictures presenting the similar states; finally, the 
privacy region on the surveillance video frames has been overlapped and the privacy 
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Entering Standing Still Exit 


Fig.5.6 Object tracking from left to right with standing state 


££ 


Fig.5.8 Overlapping the moving object from left to right 


of the event has been removed. This example indicates how we could leverage human 
privacy in surveillance (Fig. 5.5). 

As shown in Figs. 5.6, 5.8, 5.9, and 5.10, we detect the moving object, track the 
object, and find the state changes of an event in a surveillance scenario. 

In videos shown in Figs. 5.6 and 5.9, we detect the “entering,” “standing still,” 
and “exiting” states of the surveillance event [68, 119]; therefore, we could cover the 
moving object using cartoon characters (Fig. 5.6). 

In Fig.5.7, we find the cartoon pictures from public Web sites with swinging 
the right-hand, swinging the left-hand, and standing still in virtual reality; the six 
cartoons represent the states of two opposite walking directions: left to right and 
right to left through the corridor; the actions of cartoon characters could represent 
the states of surveillance events in real reality. 
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Fig.5.10 Overlapping the moving object from right to left 


5.5 Questions 


Question 1. What are the differences between bioinformatics and biometrics? 
Question 2. What are the unique features of biometrics? Why biometrics could be 
applied to intelligent surveillance? 

Question 3. What are the core issues of pattern recognition in biometrics? 
Question 4. In your personal experience, which biometric is the most reliable and 
robust one? 
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6.1 Definition of an Event 


An event is something that happens at a given place and time [27]. This definition 
has been widely accepted by international communities. An event bridges the gap 
between cyberspace and real world. The basic components of an event are the famous 
“SW”, namely, who, when, where, what and why. These components have been 
successfully used in computer vision and artificial intelligence especially in the 
computations of observation, learning, presentation, and inference. 

Event entity refers to event ID, time or duration, location, description, etc. One 
event is unique; a UUID is assigned for a specific event for the purpose of identifica- 
tion. For an event, the entities time and location only belong to this event; therefore, 
it is impossible to find two different events with the same time and location, or 
one event having different locations and time. This is the identical attribute of an 
event [27]. 

In intelligent computation, an event usually has twofold: detection and explo- 
ration [12, 13]. Event detection is based on the pattern of long-time observations. We 
have a wealth of events happening every day; for example, traffic control at a junction 
uses green, yellow, and red color traffic lights which always are being switched in the 
sequence of shifting. Without doubts, after a red light, the next will be yellow, then 
red, and so on. The story is shown in Fig. 6.1 Once we realize this occurrence, we 
find the “SW” and record the story as normal or abnormal event [16,22]. If there is an 
accident captured at this intersection, that means we detect an abnormal event. Event 
exploration refers to get new events from the known ones. For example, Alice(A) 
and Bob(B) know how to get Cindy(C)’s home from their owns, and later they could 
infer the routines how to get Alice’s home or Bob’s home from current location using 
the existing events. 

Among the events, a telic event is different from atelic one. Telic event is the one 
that has end point of its time interval; however, atelic events do not have the end point. 
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Fig.6.1 The transitions of 
traffic lights are thought as IN 
events 


Meanwhile, atomic events are different from composite one. An atomic event is the 
elementary one which is not allowed to be divided into further sub-event; it is the 
simplest and fundamental one. A composite event is defined by using composition 
of two or more atomic events which is a complex event [24]. 

Event computing has three very important aspects: operating, storing, and mod- 
eling. Event operations refer to unary operations and binary operations. Unary oper- 
ations have only one operating object, such as projection, selection, renaming, etc., 
while binary operations have two objects such as union, concatenation, conditional 
sequence, iteration, aggregation, etc. [27]. 

Event projection maps one event onto another like vectors in linear algebra. The 
projection operation is implemented by using inner product which indicates how 
one event impacts on the other. Apparently two orthogonal events have not too much 
relationship mutually; however, two vectors conforming to the identical direction 
take the maximum influence on each other. Mathematically, cosine function y = 
cos(x), x E€ (—00, +00) takes effect as it is used for inner product of two vectors Vy 
and V2. V1 - V2 = ||Vill - || V2|| cos(a), æ € [0, x]. 

We use event databases and MPEG-7 to store visual events. In the event databas- 
es, we define the fields using the attributes of an event, namely, where, who, 
when, what and why; the records are used to store each one, the associated in- 
formation such as metadata, and sensor data are deposited in the relative ta- 
bles [3,5]. Event search and retrieval are typical operations based on the well-defined 
SQL language. In SQL, the syntax “SELECT < records > FROM < database > 
WHERE < boolean conditions >” structure is applied to event search or retrieval 
from a database. 

Boolean conditions satisfy logic laws which are indicated as follows: VA, B € 
{T, F}, ‘T? means the boolean true, while ‘F’ refers to the false: 
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Implication law: A > B=—-AV B. 
Idempotent law: A A A= A. 

Commutative law: A A B= BA A. 
Associative law: (A A B)A C=AA (BA C). 
Absorption law: A A (A v B) =A. 

Distributive law: A A (BV C)=(AA BAV (AA ©). 
Double negation law: == A = A. 

De Morgan’s law: -(A A B) =~ AV >B. 
Excluded middle law: A A~ A =F. 

Identity law: A A T = A. 

Domination law: A A F = F. 


Video Event Representation Language (VERL) is used for event description where 
it harasses the structure such as sequence, iteration, alternation, and conditional 
branches to control the program like a normal programming language; among the 
events, temporal relationships have the types like “before,” “begin,” “inside,” “end,” 
“contain,” “during,” etc., surveillance events could be presented by using semantic 
languages in visual, graphical, spatial, or ontological way [8,18]. 

The purpose of event presentation is to convey main points to audience; event 
presentation indeed has computing aspects and needs skills [28]. Good event presen- 
tations attract audience by using natural languages in the way of multimedia [24,26] 
such as text, video and audio, diagrams, figures, charts, animation and graphics, 
etc. [4], a good example is the interface of YouTube or the Gmail which links all 
relevant content together with ranking. A very special language for event presenta- 
tions in writing and oral is mathematics which is based on symbols and functions 
in logic; the mathematical symbols include V, 4, ~, |, A, and the algebra system 
< 0, 1, +, —, x, + >, etc., these symbols have been adopted and fixed for math- 
ematical presentations such as the modern concepts: group (e.g., Z*), ring (e.g., 
polynomial ring) and fields (e.g., real number field or field of reals) which could not 
be replaced by others. 

An event has six aspects: temporal, spatial, causal, experiential, informational, 
and structural. Temporal and spatial [14,23] denoted as spatial-temporal are well- 
known fundamental aspects of an event, but triggered by various reasons, namely, 
consequence is caused by its reasons. “Event driven” means one or multiple events 
are automatically triggered by conditions derived from other events, and a load of 
events could be triggered in sequential order by casual information. 


99 66 


6.2 Event Database 


Surveillance is real reality which is experiential with human portfolio. Nowadays 
our daily life is being monitored by surveillance cameras anywhere and anytime; 
therefore, like the example of traffic lights, we find patterns after long-term obser- 
vations, we call the patterns as events which are discriminative and covering. Now 
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we save the events in a database; we call this event database as eBase which is easy 
to be implemented by using programming languages such as PHP, MySQL, Ruby, 
Python, etc. 

In event archiving, we also deposit all event sensors into the eBase, each sensor has 
its attributes. The sensors connect with servers using IP address and GPS information 
while capturing data like image, video, and audio; each video, or audio footage has 
its metadata. 

Event detector is a software which is linked to the eBase. The detectors have the 
ability to link a semantic object and the stories of this object are described by using 
images, videos, or audio clips captured from surveillance cameras or other sensors. 

Users of a surveillance system are allowed to provide comments on an event that 
are stored into the eBase as evidences of witnesses. In surveillance, witnesses and 
their comments are regarded as the solidate evidences. 

During eBase creation, the database is protected by a password. The eBase se- 
curity and privacy are generally guaranteed by the system and password. Without a 
password, the eBase could not be accessed; correspondingly access control is lim- 
ited. In recent years, data security and privacy of databases have been delved from 
other aspects [1]. 

When we log into the eBase, the records are stored with the fields of event at- 
tributes. Programming languages such as MySQL, PHP, Python, and even Matlab 
which could call and run SQL language having the ability to automatically write an 
event into the eBase as a record. Event database (eBase) is able to be linked to a web 
page through a web server. Any updates to a database will be shown on the web page 
timely. 


6.3 Event Detection and Recognition 


Event is a semantic story of an object; therefore, it has duration related to time series. 
Thus in event detection and recognition, we usually use the methods having temporal 
and spatial relationships to detect and recognize events [11, 12,14, 17,23]. 

Event recognition can be split into twofold, which encompasses model-based 
approaches and appearance-based approaches [24]. In the first, Bayesian networks 
typically have been used to recognize simple events or static postures from video 
frames; meanwhile, hidden Markov model (HMM) [21] also has been applied to 
human behavior recognition [7,9,10]. Appearance-based approaches are based on 
salient regions of local variations in both spatial and temporal dimensions [11, 14, 17]. 
Boosting is adopted to learn for a cascade of filters for visual event detection [26]. In 
addition, grammar-based and statistics-based methods could also be categorized by 
using dimension of sampling support, characteristics and mathematical modeling. 

Bayesian network is a graphical model for representing conditional independen- 
cies between a set of random variables. Dynamic Bayesian Network (DBN) is based 
on Bayes’ theorem to represent sequences of variables. For example, when we see 
the scene of grass wet, we may infer the reasons from the given conditions such as 
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Fig.6.2 HMM model T=1-1 T=t T=t+1 


sprinkler or raining. Meanwhile, for raining there are a handful of reasons; for sprin- 
kler there are other possibilities as well. One primary reason may trigger happenings 
of others; hence, these reasons may form a directed graphical probabilistic network, 
and we call it as Bayesian network. Mathematically, a Bayesian network is shown 
as Eq. (6.1), 


N 
P(x, x2,--..Xn) = | [Pia (6.1) 
i=1 


where 7r; is the parent of i in the graph. 

Kalman filter is usually implemented for filtering using continuous state linear 
dynamic systems. Most of time, Kalman filtering is used as a filter in information 
processing; like other filters, those noises will be filtered out by using Kalman filter, 
and only the main part of the processing information will be left. The mathemati- 
cal description of Kalman filtering could be found from Sect. 4.4 related to object 
tracking [20]. 

A hidden Markov model (HMM) is a statistical approach that the system being 
modeled is premised to be a Markov process with unobserved (hidden) states. A 
HMM is presented as the simplified dynamic Bayesian network [6]. As shown in 
Fig. 6.2, in order to infer state x(t + 1) at the time t + 1, we use the states h(t + 1) and 
h(t) at t, respectively. An HMM was designed for predicting discrete state sequences. 

In the simplest Markov Model, like a Markov chain [21], the state S is directly 
visible to the observer and therefore the state transition probabilities are the only 
parameters. In a HMM, the state is not directly visible, but output O, dependent on 
the states, is perceivable. Each state has a probability distribution over the possible 
output. Therefore, the sequence generated by HMM reflects the gradual changes of 
the state sequence. Note that the word “hidden” refers to the state sequence through 
which the model passes. 

Given a HMM model A = (A, B, x), the parameters refer to: 


State: S = {$1, S2,..., SN} 

Observation: V = {v,,v2..., vw} 

Transition matrix: A = (ajj)v xn, @ij = P(Gri = $l = Si) 
Emission probabilities: B = (b;(m))m , bj(m) = p(O; = Vinlly = Sj) 
Initial probabilities: II = (77))y, x; = p(y = Sj) 

Output: O = {0102 ... OT} 

Latent variables: H = {H1 H2 . .. Hr} 


160 6 Visual Event Computing | 


Given a HMM model A = (A, B, x) [2], the Forward-Backward procedure an- 
swered the question: which output of HMM p(O|A) is the best one given à? 
Viterbi algorithm tells us which path H* = arg max p(H|O, à) is the the best one 


H 
in the unfolded lattice of HMM, whilst Baum—Welch algorithm is used to seek 
A* = arg max p(x |à). 


Therefore, the Baum—Welch algorithm is often used to estimate the parameters 
of HMMs, Wikipedia provides an example for estimating the initial probability, 
transition and emission matrices using Python, see: https://en.wikipedia.org/wiki/ 
Baum-Welch_algorithm. 

The Viterbi’s algorithm can be visualized by a trellis diagram. The Viterbi path 
is essentially the shortest path through this trellis. Wikipedia provides an example 
for what is the most likely sequence of health conditions of the patient after several 
days’ observations, see: https://en.wikipedia.org/wiki/Viterbi_algorithm. 

Conditional random fields (CRFs) are undirected probabilistic models designed 
for segmenting and labeling sequence data [15, 19]. When we use CRFs for event de- 
tection and recognition, we deploy the tasks of those layers; after training and testing, 
we obtain the parameters and the semantic events from detection and recognition. 

In Markov random field (MRF), the label set f is said to be a CRF, given d 
(observation), 


Pld, fy—y) = pGfild. f1) (6.2) 


P(f\d) = > exp į — È Vi (fild) — be XO Vof feld) 
ie.S ie S 'eNM 
where Vı(-) and V2(-) are called association and interaction potentials, respectively. 
Deep Markov random field (DMRF) [25] could be applied to texture synthesis, 
image superresolution, etc. In DMRF, the hidden state h, and the pixel x, together 
form an MRF; each hidden state h, connects to the neighboring states h,, the neigh- 
boring pixels x,, and the pixel at the same location. The dependencies are reflected 
in the function ¢ (xy, hy), (Ay, hy) and Y (hy, xy). A MRF is therefore expressed as 


1 
P&H) = FT] peru h) TY u hy Cu xv, x) [] Aw) (6.3) 


ueV (u,v)EE ueV 


where V and E are the sets of vertices and edges of the MRF field, respectively. Z is 
the partition function as well as 4(h,) is the regularization function. 

€ (Xu, hu) reflects how the pixel values are generated from the hidden states which 
is subject to Gaussian Mixture Model. ¢ (hy, hy) and w (hu, xy) are fully connected. 
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Table 6.1 An example of Bayesian theorem 


Event Description Probability 
A Someone has disease “C” 0.02 
B Someone has disease “H” 0.10 
BIA Someone has disease “H” given disease “C” 0.80 
A|B Someone has disease “C” given disease “H” 0.16 


According to the Expectation-Maximization algorithm, the posterior distribution 
of h; (in E-steps) and optimized parameter 0 (in M-steps) could be iteratively calcu- 
lated. 


P 1 n 
Ô = arg max — XU Epwsthi.o) log pai, hil) (6.4) 


6.4 Event Classification 


For event classifiers, the basic one is Bayesian classifier which is based on the famous 
Bayes’ theorem shown as Eqs. (6.5) and (6.6) while Bayes’ theorem remarkable as 
the basestone of modern pattern classification: 


P(BIA) - P(A) 
P(A|B) = ——————_ 6.5 
(AIB) DE) (6.5) 
P(A|B) - P(B) 
P(B|A) = ———————— 6.6 
(BIA) pt) (6.6) 


The Eqs. (6.5) and (6.6) could be explained as a prior modified by using likelihood 
compared to evidence so as to get the posterior. 


Pri likelihood 
ee rior x ikelihoo (6.7) 
evidence 


Fundamentally, we have posterior probability based on the prior, likelihood and 
evidence shown as as Eq. (6.7). For example in Table 6.1, 


P(BIA)P(A) _ 0.80 x 0.02 


= = 0.16 
P(B) 0.10 


P(A|B) = 


P(A|B)P(B) _ 0.16 x 0.10 


= = 0.80 
P(A) 0.02 


P(BIA) = 
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An application of Bayes’ theorem is Naive Bayesian Classifier for classifying 
documents, it has been applied to spam email filtering: 


p(DIS) = | [ povilS) (6.8) 


p(D|-S) = | | povil-s) (6.9) 


where D is a document for classification, S represents a spam document, and w; is 
the word included in a document. Applying Bayes’ theorem to this application, the 
possibility of a document is classified into spam or not is given by: 


PDS): p(S) _ p(S) T]ponisy 


(S|D) = E (6.10) 
a pD pd). 
PEIS) POS) _ pS) 
(-S|D) = = (wi|>S) (6.11) 
i p(D) p(D) IG 
Now we combine Eqs. (6.10) and (6.11) together, 
PCS) | | powilS) 
psp) _ PO eis 
pOSID) pS) TT p(wil-S) 
For the purpose of simplifying the calculation, we have, 
D i 
n26) _ 1, PS) Sin p(wilS) (6.13) 


pOSID) pS) p(wil-S) 


i 
Hence, if In Pan > 0, i.e., p(S|D) > p(-S|D), the document is not spam, or 
else it is a spam one. 


6.5 Event Clustering 


Clustering is regarded as the procedure: given a set of data points, the data are grouped 
into several clusters so that within one cluster, the data points are more similar to 
another; data points in another cluster are less similar. Usually a distance between 
two vectors known as similarity measurements is used for clustering; the typical 
measure d = |X — Y| between two vectors X = (x1,...,%,) and Y = (y1, ..., yn) 
includes: 
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e Euclidean distance, 
e Minkovski distance, 


e Chebyshev distance, 


1 
P 


N 
d = iin. (> (xi — vy) = max |p; — qil 
i= 


Manhattan distance, 


N : N 
d = lim (£ (i -»») =} ipi -4il 
i= i=l 
e Mahalanobis distance, 
di = (xi — yi) ET! @— y) 


where (x;, yi), i = 1, 2, ..., N are the data points. In clustering, two typical clustering 
approaches are, 


e Partitional clustering: a division groups data objects into nonoverlapping subsets 
so that each data object is in exactly one subset. 

e Hierarchical clustering: a set of nested clusters are organized as a hierarchical 
tree. 


k-means clustering is a kind of partitional clustering approaches. In k-means 
clustering, (1) each cluster is associated with a centroid (center point); (2) each point 


is assigned to the cluster with the closest centroid; (3) the number of clusters K must 
be specified. The algorithm of k-means is shown as the following algorithm (3). 


6.6 Questions 


Question 1. What is an event? What are the components of an event? 
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Input : Data points; 

Output: Clusters; 

Initialization; 

Select the number of clusters K as the initial centroid; 


while Centroid changes existing do 
(1) Form K clusters by assigning all points to the closest centroid; 


(2) Recompute each centroid of each cluster; 
(3) Find the difference of centroid of each cluster; 
end 
Algorithm 3: K-means clustering algorithm 


Question 2. What is event computing? What are the event operations? 


Question 3. How to store visual events in MPEG-7? 


Question 4. What is the k-means clustering? 
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Visual Event Computing II 


7.1 Event Search and Retrieval 


After events are stored in the event database (eBase), we could create an index 
based on these records. The index plays a dominant role and saves the time of 
information retrieval [1,2]. This has simplified and hastened the relevant operations 
and processing, especially those over the Internet. 


7.1.1 Lexical Search and Retrieval 


On the site of distributed Web servers, the index for search and retrieval needs 
tokenization, namely, segmenting the input text into words after dropping stop words 
such as articles (‘a, “an,” and “the”), adjectives (e.g., “red,” “blue,” “yellow,” “cold,” 
“hot,” etc.), prepositions (e.g., “of,” “when,” “where,” etc.), and pronouns (e.g., ‘I,’ 
“you,” “she,” “he, “her,” etc.) [1]. 

In linguistics, stop words are the ones which are filtered out before or after nat- 
ural language processing (in text). Stop words usually have very highly statistical 
occurrences with very low functionalities, such as words: “the,” “is,” “at,” “which,” 
and so on. Stop words cause issues which may affect retrieval when searching for 
phrases that include them, particularly in names such as “the who,” or “take that.” 
Some search engines remove the most common words including lexical ones such 
as “want” from a query in order to improve performance. 

The general strategy for determining a stop list is to sort the terms by collecting 
the word frequency which is the total number of times each term appears in the 
document collection and then to take out the most frequent terms for their semantic 
content related to the documents being indexed. As a stop list, the members of this 
list are discarded during indexing. An example of a stop-word list is ‘a, “an,” and, 
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“the.” Using the list significantly reduces the number of postings that a system has 
to deposit. 

In linguistic computations, the textual normalization process means to capitalize 
or make words in low case; after processed, the words are in uniform. For an example, 
the word “student” will be uppercased as “Student” and so on so forth [1]. 

The operation stemming means to find the roots of words such “fishing,” “fisher,” 
“fished,” removing the useless strings that only uses the word “fish” for indexing; 
stemming usually refers to a crude heuristic process that chops off the ends of words 
in the hope of achieving this goal correctly, and often includes the removal of deriva- 
tional affixes. 

In natural language processing (NLP), lemmatization is to get the semantic roots 
of each word such as “good,” “better,” “best.” Lemmatization usually refers to work 
properly with the use of a vocabulary and morphological analysis of words, normally 
aims to remove inflectional endings only and to return the base or dictionary form 
of a word, which is known as the lemma [11]. Linguistic process for stemming or 
lemmatization is often tackled by an additional plug-in component to the indexing 
process, and a number of such components exist in open source. 

In computer science, an inverted index (also referred to as posting file or inverted 
file) is an index data structure storing a mapping from content, such as words or 
numbers, to its locations in a database or in an XML file, or a set of documents. The 
purpose of an inverted index is to allow fast full text search, at a cost of increased 
processing when a document is added into the database. The inverted file may be 
the database itself, rather than its index which is the most popular data structure in 
document retrieval systems. Inverted index uses the position of each word in the 
sentences of an article to find a sentence in context. Given a keyword sequence, the 
positions of relevant words will be found at the positions of intersections of all words. 
Intersecting the posting list leads to the searching results [11]. 

For example, suppose we have the sentence 0, “what is it,’ sentence 1, “it is a 
banana,” and sentence 2, “it is,” then we have the posting list as “a” ={1}, “banana” 
={1}, “is” ={0, 1, 2}, “it”? ={0, 1, 2}, “what” ={0}; the sentence “what is it?” could 
be found from the position intersecting {0} N {0, 1,2} N {0, 1,2} = {0}. 

There are two main variants of inverted indices. A record-level inverted index (or 
inverted file index or just inverted file) contains a list of references to documents 
for each word. A word-level inverted index (or fully inverted index or inverted list) 
additionally contains the positions of each word within a document. The latter form 
offers more functionality (like phrase searches) but needs more time and space to be 
processed. 

Data structure of inverted index is a central component of a typical search engine 
indexing algorithm. The goal of a search engine is to optimize the speed of query: 
finding the documents where a word occurs. Once a forward index is developed, 
which stores the list of words per document, it is next to develop an inverted in- 
dex. Querying the forward index would require sequential iteration through each 
document and each word to verify a matching document. The time, memory, and 
processing resources to perform such a query are not always technically realistic. In- 
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Table 7.1 An example of tf — idf withtf,df and idf of the terms 


tf (docl) |tf (doc2) | tf (doc3) 
“Algorithm” 4,250 3,400 5,100 J eee eee 850 0.07 
“The” 50,000 43,000 55,000 sf «+++ 1,000 0.00 
“Book” 7,600 4,000 2,000 Jeers 400 0.40 
“Surveillance” | 600 0 25 ee 25 1.6 


Table 7.2 An example of tf — idf with the scores and ranks of terms 


Terms tf_idf (doc1) tf_idf (doc2) tf_idf (doc3) |+ 
‘Algorithm’ 

“The” 

“Book” 3,024.34 1,591.76 795.88 

“Surveillance” 961.24 0.00 40.05 

Scores 4285.57 1831.74 1195.89 

Ranking 1 2 3 


stead of listing the words per document in the forward index, the structure of inverted 
index is developed which lists the documents per word. 

For keyword-based search, we also use the score of inverse document frequency 
based on index and Eqs. (7.1) and (7.2) for ranking, where tf is term frequency, 
idf refers to inverse document frequency, df represents document frequency, N is 
document number, d is the documents, and q is the key-word based query [11]. 


idf =1 ( = ) (7.1) 
idf = ln df+i : 
tf_idf =tf xidf (7.2) 
The score for a query q is calculated as, 
S(q.d) = )\(tf_idf) (7.3) 


The queried documents will be listed according to the sum of the scores in the 
ranking. An example of four terms in three documents of a fictional collection N = 
1, 000 is shown in Tables 7.1 and 7.2. 


7.1.2 Global and Local Search 


In computational intelligence, local and global search includes breadth-first search, 
depth-first search, depth-limited search and bidirectional search, etc. [5,14]. 
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7.1.2.1 Breadth-First Search (BFS) 

Breadth-first search goes through the tree by levels, traverses all of the nodes on 
the top level first, then on the second level, and so on. This strategy has the benefit 
of being completed and optimal as long as the shallowest solution is the best one. 
However, the way that the breadth-first search is achieved by keeping all of the leaf 
nodes in memory, which requires a prohibitive amount of memory when searching 
for any nodes more than a very small tree. 

Mathematically, let G = (V, E) be a graph with n vertices |V| = n; N (v) is the 
set of neighbors of v. Let o = (v1, ..., Vn) be an enumeration of the vertices of V. 
The enumeration ø is said to be a BFS ordering if, for all 1 < i < n, v; is the vertex 
weéeV\{v1,..-,¥; — l} such that vq, ..y;_,)(w) is the minimal. BFS therefore is 
iterative. 

Breadth-first search is useful when 


The space is not a problem; 

Finding the solution contains the fewest arc; 

A few solutions may exist, and at least one has a short path length; 

Infinite paths may exist, because it explores all of the search space, even with the 
paths. 


7.1.2.2 Depth-First Search (DFS) 

Depth-first search goes through the tree by branches, along all the ways down to 
the leaf nodes at bottom of the tree before trying the next branch over. This strategy 
requires much less memory than breadth-first search, since it only needs to store a 
single path from the root of the tree down to the leaf node. However, it is potentially 
incomplete, since it will keep going down one branch until it finds an end, and it is 
non-optimal, if there is a solution at the fourth level in the first branch tried and a 
solution at the second level in the next one, the solution at the fourth level will be 
returned. 

In depth-first search, the frontier acts like a last-in first-out stack. The elements 
are added to the stack one at a time. The one selected and taken off the frontier at any 
time is the last element that was added. Implementing the frontier as a stack results 
in paths being pursued in a depth-first manner - searching one path to its completion 
before trying an alternative path. 

Because depth-first search is sensitive to the order in which the neighbors are 
added to the frontier, caution must be taken sensibly. This ordering can be fulfilled 
statically and dynamically where the ordering of the neighbors depends on the goal. 

Again, let G = (V, E) bea graph with n vertices; o = (v1, ..., Vn) is an enumer- 
ation of the vertices of V. The enumeration o is said to be a DFS ordering if, for all 
1 <i <n, vi is the vertex w € V \ {v,..., vi — 1} such that vq), ,)(w) is the 
maximal. 

Depth-first search is appropriate when, 


e The space is restricted; 
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e Many solutions exist, particularly for the case where nearly all paths lead to a 
solution; 

e The order of neighbors of a node is added to the stack so that solutions are found 
on the first try. 


Depth-first search is the basis for a number of other algorithms, such as iterative 
deepening. This algorithm does not specify the order in which the neighbors are 
added to the stack that represents the frontier. The efficiency of the algorithm is 
sensitive to the ordering. 

Depth-limited search essentially conducts a depth-first search with a cutoff at a 
specified depth limit. When the search hits a node at that depth, it stops going down 
the branch and moves over to the next one. This avoids the potential issues with 
depth-first search of going down one branch indefinitely. However, depth-limited 
search is incomplete - if there is a solution only at a level deeper than the limit, it 
will not be found. 

Iterative deepening search commits repeated depth-limited searches starting with 
a limit of zero and incrementing once each time. As a result it has the space-saving 
benefits of depth-first search but is also complete and optimal because it will visit 
all the nodes on the same level first before continuing to the next level in the next 
round when the depth is incremented. 


7.1.2.3 Informed Search 
The objective of a heuristic is to produce a solution in a reasonable time frame that 
is good enough for solving the problem at hand. This solution may not be the best 
of all the actual solutions to this problem, or it may simply approximate the exact 
solution. But it is still valuable because finding it does not require much time. 
Informed search greatly reduces the amount of time by making intelligent choices 
for the nodes that are selected for expansion. This implies there exist optional ways 
of evaluating the likelihood of a given node which is on the solution path. 


7.1.2.4 Best-First Search 

Suppose that one has an evaluation function h(n) defined at each node n that estimates 
the cost of reaching the goal from this node. A search that chooses the node on the 
agenda for which this function is the minimum is called a greedy search. Generally, 
its performance is not better than the breadth-first search. 

The algorithm that chooses the node on the agenda for which function is the 
minimal is called A* search. It is considered important because heuristic search 
means uncertainty. The best-first search is to find the best and stop while A* search 
is to find the best one and then delete the branch, the memory-bounded greedy search 
is to keep finding the best one till it could not find another one. 

The A* algorithm combines features of uniform cost search and pure heuristic 
search to efficiently compute optimal solutions. A* algorithm is a best-first search 
algorithm in which the cost associated with a node is f(n) = g(n) + h(n), where 
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g(n) is the cost of the path from the initial state to node n and h(n) is the heuristic 
estimate, the cost or a path from node n to a goal. Thus, f (n) estimates the lowest 
total cost of any solution path going through node n. At each point, a node with 
lowest f value is chosen for expansion. Ties among nodes of equal f value should 
be broken in favor of nodes with lower values. The algorithm terminates when a 
goal is chosen for expansion. 

For the search problem like puzzles, A* algorithm can find optimal solutions to 
this type of problems. In addition, A* algorithm makes the most efficient use of the 
given heuristic function in the following sense: among all shortest path algorithms 
using the given heuristic function h(n), A* algorithm expands the fewest number of 
nodes. 


7.1.2.5 Hill-Climbing Search 

Local search is a heuristic method for solving computationally hard optimization 
problems. Local search is used on solving problems that are formulated as finding 
a solution maximizing a criterion among a number of candidate solutions. Local 
search algorithms move from solution to solution in the search space by applying 
local changes until a solution deemed optimal solution is found or a time bound is 
elapsed. 

Hill climbing search, as we climb a mountain to find the best one, has a rest then 
tries in different ways simulated annealing search taken the surroundings into con- 
sideration. The core of this algorithm is a mathematical optimization which belongs 
to the family of local search. The iterative algorithm starts with an arbitrary solution 
and then finds a better solution by incrementally changing a single element of the 
solution. If the change produces a better solution, an incremental change is made to 
the new solution, repeating until no further improvements can be found. 

Mathematically, hill climbing search maximizes (or minimizes) a target function 
f(X), where x is a vector of continuous or discrete values. If y = f(x) is a surface, it 
may have various scenarios such as one maximum, multiple local maxima, or a ridge. 
Hill-climbing search algorithms include simple hill-climbing search, stochastic hill- 
climbing search and random-restart hill climbing. 

Hill climbing algorithm is good for finding a local optimum but it is not guaranteed 
to find the best possible solution, namely, the global optimum out of all possible 
solutions within the search space. Local optima are guaranteed by using restarts 
only, i.e., repeated local search, or more complex schemes based on iterations like 
iterated local search, on memory like reactive search optimization, or memory-less 
stochastic modifications like simulated annealing. 


7.1.2.6 Genetic Search 

Genetic algorithm (GA) searches surroundings, parental and sibling levels, finds 
the best way to grow. GA algorithm is a search heuristic that mimics the process 
of natural selection. This heuristic is routinely used to generate useful solutions to 
optimize search problems. 
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GA algorithm belongs to the larger class of evolutionary algorithms (EA) which 
generate solutions to optimization problems using techniques inspired by natural 
evolution, such as inheritance, mutation, selection, and crossover [8,9]. The steps 
for a GA algorithm are: 


Step 1. Initialization: Create an initial population; e.g., we assume four strings (S;, 
i = 1,2,---,N, N = 4) consisting ‘0’ and ‘1’ have been created like the DNA in 
a gene: Sı = (01101)2 = (13)10, S2 = (11000)2 = (24)10, S3 = (01000). = (8)j0, 
S4 = (10011)2 = (19)10, where subscript ‘2’ refers to a binary number while “10” 
for decimal. 

Step 2. Evaluation: Evaluate each member of the population, and calculate a “fit- 
ness” for the individual; e.g., we assume the fitness function is f (S) = S?, therefore 
F(S;) = S?,i =1,2,--- , N. Hence S? = 169, S? = 576, S? = 64 and S? = 361. 
Step 3. Selection: Constantly improve populations’ overall fitness. The probabili- 
ties of the given strings are p(S;) = —LS2—-; thus, p(S1) = 0.14, p(S2) = 0.49, 


ENT FSi)’ 
p(S3) = 0.06 and p(S4) = 0.31. 

We therefore create four intervals to be selected: [0, $1], (S1, $1 + S2], (S1 + 
So, Sy + S2 + $3], (S1 + S2 + $3, S1 + S2 + $3 + S4]; namely, [0, 0.14], 
(0.14, 0.63], (0.63, 0.69], (0.69, 1.00]. We hence generate four random numbers: 
ry = 0.450126, r2 = 0.110347, r3 = 0.572496, and r4 = 0.98503. Correspond- 
ingly, the string Sı has been randomly selected for once, S2 twice, $3 null, S4 once. 
Therefore, we generate new strings Si = (11000)2 = (24)10, S4 = (01101) = 
(13)10, $4 = (11000)2 = (24)10, and S4 = (10011)2 = (19)10. 

Step 4. Crossover: Create new individuals by combining aspects of the selected 
individuals. Using the same example, we operate crossover on the last two binary 
digits of the strings $| and S4 as well as S4 and $4. We get: 

Si = (11001)2 = (25)10. SY = (01100)2 = (12)19. S¥ = (11011)2 = (27) 10. 
S% = (10000)2 = (16)10. 

Step 5. Mutation: Add randomness into populations’ genetics. We assume there is 
no mutation in this example. 
Step 6. Repeat: Start again from Step 2 until a termination condition is reached. 

We get 4G strings as: 

Sı = (111112 = 1)io, S2 = (11100)2 = (28)10, S3 = (11000)2 = (24)10, 
S4 = (10000)2 = (16)j0. 

Because Sı = (11111)2 = (31)10 reaches the biggest number of the five binary 
digits; namely, the highest ranking fitness has reached (one of the termination con- 
ditions), the iteration is terminated. 

The termination conditions usually include: 


A solution is found that satisfies the minimum criteria. 

A fixed number of generations reached. 

An allocated budget (e.g., computational time, etc.) reached. 
The highest ranking solution’s fitness has reached. 

A plateau no longer produces better results. 

Combinations of the above. 
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We therefore can summarize GA as Algorithm (4). 
Result: Search output based on GA Algorithm 
initializes P (0); /* y = P (x) is the fitness function*/; 
n < 0; /* n is the generation number; N is the max number*/; 
/* M is the max number of individuals*/; 

while n < N do 

fori =0;i < M; i + + do 

Evaluate Fitness of P (i); 

end 

fori = 0; i < M; i + + do 

Select Operation to P (i); 

end 

fori = 0; i < M; i + + do 

Crossover Operation to P (i); 

end 

fori =0;i < M; i + + do 

Mutation operation to P (i); 

end 

fori = 0; i < M; i + + do 

P(i +1) < PW); 

end 

n<n+l; 

end 


Algorithm 4: GA for search 


7.1.2.7 Online Search 

Online search algorithm creates a map and finds a goal if the best one exists. Simulated 
annealing (SA) is a probabilistic technique for approximating the global optimum of 
a given function [4]. SA algorithm hunts the local best through neighbor selection 
randomly by examining their states, energy, and acceptance probability, after several 
iterations, finally terminates with the global optimal solution. Simulated annealing 
(SA) is for the global optimization problem of locating a good approximation to the 
global optimum of a given function in a large search space. It is often used when the 
search space is discrete. Simulated annealing may be more efficient than exhaustive 
enumeration provided that the goal is merely to find an acceptable solution in a fixed 
amount of time rather than the best possible solution. 

Although hill climbing algorithm is surprisingly effective at finding a good so- 
lution which has a tendency to get stuck in local optima, the simulated annealing 
algorithm is excellent at avoiding this problem and is much better on average at 
finding an approximate global optimum. When we measure the information retrieval 
results, we need to answer the questions, 
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e Computing completeness: whether it could guarantee to find one or not? 

e Computing optimality: whether the solution is the better one? 

e Computing complexity: how is the time complexity? How much memory we 
need? 


In order to get the search evaluations, the training set, test set, and ground truth 
are needed. In information retrieval, we always need ground truth which is the 
dataset [12] manually labeled based on the facts. 


7.2 Event Mining and Reasoning 


The main goal of event mining is to provide automatic manner for surveillance event 
that is used to respond the real-time observation of people and vehicles in surveillance 
[6]. In surveillance, searching for an important clue in order to solve a problem is 
just like finding a needle in a haystack [14]. 


7.2.1 Event Mining 
The typical categories of event mining include, 


e Mining from event stream 
e Mining from sequence events 
e Event graph mining. 


7.2.2 Event Reasoning 


7.2.2.1 Forward Chaining 

Forward chaining is one of the two primary methods of reasoning when using an 
inference engine. Forward chaining is a popular implementation strategy for expert 
systems in computational intelligence. 

Forward chaining starts with the available data and uses inference rules to ex- 
tract more data until a goal is reached. An inference engine using forward chaining 
searches the inference rules until it finds one where the antecedent is known to be 
true. When such a rule is found, the engine can infer the new information from the 
given data. Inference engines will iterate through this process until a goal is reached. 
The name “forward chaining” comes from the fact that the computer starts with the 
data and reasons its way to the answer, as opposed to backward chaining. 


176 7 Visual Event Computing II 


7.2.2.2 Backward Chaining 

Backward chaining (or backward reasoning) is an inference method that is described 
as working backward from the goal(s). It is used in automated theorem provers, 
inference engines, proof assistants and other applications. 

Backward chaining is one of the two most frequently used methods of reasoning 
with inference rules and logical implications which usually employs a depth-first 
search strategy. 

Backward chaining starts with a list of goals and works backward from the con- 
sequent to the antecedent to see if there is data available that will support any of 
these consequents. An inference engine using backward chaining would search the 
rules until it finds one which has a consequent that matches a desired goal. If the 
antecedent of that rule is not known to be true, then it is added to the list of goals in 
order for one’s goal to be confirmed; one must also provide data that confirms this 
new rule. 


7.2.2.3 Probabilistic Reasoning 

A dynamic Bayesian network (DBN) is a Bayesian network which relates to each 
other [4]. DBNs have shown the potential for a wide range of data mining. For exam- 
ple, in speech recognition, digital forensics, protein sequencing, and bioinformatics, 
DBNs have shown to produce equivalent solutions to hidden Markov model (HMM) 
and Kalman filters [3]. 

Dempster-Shafer (DS) theory refers to the original conception of the theory of 
Dempster and Shafer. DS theory allows one to combine evidences from multiple 
sources and reach at a degree of belief represented by a function. In particular, 
different rules for combining evidences are often with a view to handle conflicts in 
evidence better. 

Kalman filtering [13] also known as linear quadratic estimation (LQE) is an algo- 
rithm that uses a series of measurements observed over time, contains noises (random 
variations) and other inaccuracies, and produces estimates of unknown variables that 
tend to be more precise than those based on a single measurement alone. More for- 
mally, Kalman filter operates recursively on streams of noisy input data to produce 
a Statistically optimal estimate of the underlying system state. 

Kalman filtering works in a two-step process. In the prediction step, the filter 
produces estimates of the current state variables along with their uncertainties. Once 
the outcome of the next measurement is observed, these estimates are updated using 
a weighted average, with more weight being given to estimates higher certainty. 
Because of the recursive nature of algorithms, it can run in real time using only the 
present input measurements and the previously calculated state as its uncertainty 
matrix. 
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7.3 Event Exploration and Discovery 
7.3.1 Event Discovery 


Event discovery derives knowledge from the existing knowledge for obtaining the 
unknown acknowledge; the input includes event databases (eBase), and event entities 
in text, event graphs in contextual information, event web [10,16] as the event carrier 
as well as the output has metadata, tags, rules, ontological results of events, etc. [7, 15]. 


7.3.2 Event Exploration 


Event exploration [16] is to enrich the intellectual capital in knowledge of event, the 
main tasks of event exploration for a computer include: 


To know what it already knows. 
To grow what it knows 

To share what it knows 

To save what it knows 


A computer gains knowledge from the training dataset, as a machine, a com- 
puter explores further unknown information from what it already knows, grows and 
enriches the knowledge, circulates the knowledge to the community, and saves the 
knowledge for further exploration. As the intelligence of our human beings, com- 
puters grow its knowledge in the cognitive ways using artificial intelligence. 

In summery, the life cycle of events includes event detection and recognition, then 
archives them in an event database for search, retrieval, mining, and reasoning, finally 
for discovery and exploration. The keypoint of event life cycle is to generate new 
events based on existing ones by using various event operations and repeat the same 
life routine of an event. The newly generated events could join the life cycle and be 
applied to generate other events; thereafter, the event database will be updated [16]. 
The flow could be shown in Fig. 7.1. 


7.4 Questions 


Question 1. What is the concept event? What are the relationships between events? 
Question 2. What are the operations between events? 

Question 3. What is the life cycle of an event? 

Question 4. What are the limitations of genetic algorithm (GA)? 
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Fig.7.1 The life cycle of 
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Surveillance Alarm Making 


8.1 Alarm Setting 


In a surveillance system, alarming conditions usually have been well set at the time 
of sensor deployment. Once the conditions are satisfied, the alarming system will 
be activated. A plurality of communication channels will be employed to deliver the 
messages simultaneously. The alarming usually includes three types, i.e., rule-based 
alarming, probability-based alarming, and system-based alarming. 


8.1.1 Rule-Based Alarming 


If the triggering conditions such as sensitive area and restrained time are satisfied, 
then alarming function will be activated. Logically, we describe the alarming proce- 
dure as, 


IF < condition > THEN < action > END 


The rule-based alarming systems manage the alarming process by creating rules 
or triggering conditions which automatically work for alarm processing including 
delivering alarms to multiple communication channels and store the alarms. The 
systems provide a centralized location for organizing and viewing the alarms through 
networks or the Internet. 

The rule-based alarming systems are important in monitoring for early detected 
hazards. The regions where the alarming systems are set and the thresholds applied 
are dependent on the specific needs of the surveillance system. In addition, we set 
specified conditions and actions to be performed at the first second when an alarm 
is activated. 
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The rule-based alarming systems allow us to configure a rule onsite in alarm pro- 
cessing, ranking, and storing [10]. The systems provide tools and back-end services 
to effectively manage the alarm information. 

A tule consists of two parts, namely a set of conditions and a set of actions. If all 
the conditions specified in a rule are satisfied, then the set of actions in the rule will 
be activated. The rules are employed in the order of procedure; the topmost rule has 
the highest precedence in the ranking list. 

When more than one rule is specified, the incoming alarm is matched with the 
rules beginning with the topmost enabled rule in the list. If the incoming alarms 
match with any rules, then only corresponding actions are committed, and there will 
be no further processes of remaining rules. 


8.1.2 Probability-Based Alarming 


It is desired to check alarms from the view of risks, error avoidance, etc. When 
alarming conditions are set, and if we take the risk avoiding into consideration [16], 
we hope to make a decision with less errors [9]; hence, the best choice is to select 
the less errors [8]. Mathematically, 


P(error|x) = min(P(@ |x), P(w2|x)) (8.1) 


P(a |x), P(wi|x) < P(w2|x) 


Plai), Piola) < Poil) 2) 


P(error|x) = | 
where P(q@j;|x) is the error probability of the alarm x given the choice w;. We use 
Bayes’ theorem for the conditional probability: 


P(a@ |x), P(x|@1) -P(@1) < P@|@2) - P(@2) 


P(wy|x), Pla): Plo) < Palo) Po E9 


P(error|x) = | 


If we treat the error as the difference between two conditional probabilities, we 
select positive difference for the right alarms: 


8 (x) = P(@ |x) — P(@2|x) (8.4) 
_ | Pils), gœ) <0 
P(error|x) = ee rere (8.5) 
For a monotone function, its logarithmic function is also monotonic. 
g(x) = In P(@| |x) — In P(@2|x) (8.6) 


If we use Bayes’ theorem for the conditional probability, then 


g(x) = In[P(xl@1) - P(@1)] — In[P(|@2) - P(@2)] (8.7) 
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toy — 7, Palo) P(a1) 
g(x) =In Palo) +1n Plan) (8.8) 


_ J Pix), g'(x) < 0 
P(error|x) = eee id= 0 (8.9) 
If we take risks into consideration, we select the case with less risk 
_ | R@ilx), Rai |x) < R(a@|x) 
Os eae R(ai|x) > Rœ) nee) 


where R(-) is the risk function. 
If the risk is allowed to be expressed by using a linear combination of probabilities, 


R(@ |x) = Aqy + P(@ |x) + à12 - P(@2|x) (8.11) 
R(a@2|x) = A21 - P(@ |x) + A22 : P(w2|x) : 
Then, 
_ J Rix), A11 — à21) - Pl@i |x) < A22 — à12) - P(@2lx) 
PACT) = P (11 — Az) Pwrlx) > (rr — Ata) Pa AA 


Again, we use the Bayes’ theorem [14] 


R(@|x), (Ar — à21) - P(xl@y) - P(@1) < (A22 — å12) - P(x|@2) - P(@2) 


Care 


(8.13) 
Paloi) < Ag—Ain , P(w2) 
Rene Rila), Pao £ Mior Pon (8.14) 
%5 ] Reap |x), Palen > 1212 , Plor) i 
2) Plax) 7 Anaa ` Ploi) 


Hence, alarm making from the rule-based aspect is to reduce errors and avoid 
risks in computational way. 

Alarm making from the aspect of maximum a posteriori (MAP)-Markov random 
field (MRF), a risk in Bayes estimation is defined as, 


R(f*) = I CU. PPF Iddf (8.15) 
JEF 


where d is the observation, P (f |d) is the posterior distribution, C(f*, f) = If* — f ||? 
is a ô(f*, f) cost function, 


we a (8.16) 
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Therefore, if, 
aR(f*) 
of* 


then, we have, 


f* = arg maxP (f |d) 
SEF 


8.1.3 System-Based Alarming 


System-based alarm is also called an event-driven one which adopts expert systems 
as the domain knowledge [3]. Therefore, whether an alarm is triggered or not is 
completely subject to the associated events with the current one. If one or many 
events as the triggering conditions are satisfied, the alarm will be activated [6,11]. 

This system-based alarming is derivative from event detection and recognition, 
has been applied to intelligent surveillance. MATLAB has the event-driven blockset 
as a toolbox to simulate the event-triggering processes. The blockset makes alarm 
setting, programming, and activation much easier; it also provides such an interface 
for sensor management and signal input. The automated control is associated with 
MATLAB image and vision toolbox, and any special events detected by MATLAB 
will be utilized for alarm making. 


8.2 Decision Making 
8.2.1 Simple Decision Making 
In computational intelligence, a decision has the following properties: 


e Orderability: 
If a decision is successive, namely a > b, meanwhile c is after b, b > c, thena > c 
is explained as c is behind a. 

e Transitivity: 
If it is true that decision a is transited to decision b, i.e., a —> b and b —> c, then 
a is able to be transited to decision c, i.e., a > c. 

e Continuity: 
If a decision is continuous at xo that means the decision is continuous from left 
and right sides, i.e., lim f(x) =f (xo), then lim f(x) =f) = lim f(x) = 

x—> x0 x>g XX 

f (xo ) =f (Xo). 

e Substitutability: 
If decision a substitutes decision b, namely a + b and b & c, then a is able to 
be applied to replace c, a } c. 
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e Monotonicity: 
If decision a is more critical than that of b, a > b, then it will take greater effects 
than that of b, f (a) > f (b). f (-) is a function defined on the decision domain. 

e Decomposability: 
If decision x is within a predefined domain, it must be decomposed or factored 
by subdecisions. Vx € 92, f (x) = h(a) - g(B), œ and £ are subdecisions, and f (-), 
h(-), and g(-) are the functions defined on the decision domain. 


Decision is made based on a relationship network. A decision network (also called 
an influence diagram) is a compact graphical and mathematical representation of a 
decision situation. An influence diagram includes uncertainty node (oval), decision 
node(rectangle), and value node (octagon) [7]. 

An influence diagram sets evidence variables for the current state and possible 
value of the decision node. That means to set the decision node to that value, export 
the posterior probabilities for the nodes and export the result for the action. 

Decision networks extend belief networks which include decision variables and 
utilities. In particular, a decision network is a directed acyclic graph. 


8.2.1.1 Expert System 

In artificial intelligence, an expert system is a computer system that emulates the 
decision-making ability of a human expert. Expert systems are designed to solve 
complex problems by reasoning about knowledge, represented primarily as If< 
condition >Then< action > rule rather than through conventional procedural code. 

An expert system is divided into two subsystems: inference engine and knowledge 
base which represents facts and rules. The inference engine applies the rules to the 
known facts so as to deduce new facts. 

In early expert systems, these facts were represented primarily as flat assertions 
of variables. Later, the knowledge base took on more structure and utilized concepts 
from object-oriented programming (OOP). Instances and assertions were in lieu of 
values of object instances. 

The inference engine is an automated reasoning system that evaluates the current 
state of the knowledge base, applies relevant rules, and then asserts new knowledge 
into the knowledge base [12]. The inference engine may also include capabilities 
for explanation so that it can explain a user the chain of reasoning used to arrive 
at a particular conclusion by tracing back over the firing rules that resulted in the 
assertion. 

There are primarily two modes for an inference engine: forward chaining and 
backward chaining. The different approaches are dictated by whether the inference 
engine is being driven by the antecedent or consequent of rule. In forward chaining, 
an antecedent fires and asserts the consequent. 

For an expert system, the team consists of experts and engineers. The fundamental 
procedure of an expert system is listed as: 
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Step 1. Acquire domain knowledge. The knowledge is from two aspects: experts 
with domain knowledge and software engineers. Software engineers digi- 
talize human experience from domain experts to understand the requirement 
of software design and implementation. 

Step 2. Create a causal model. The team consisting of experts and software engi- 
neers initialize the project, set the initial parameters of the system, and start 
the design and development. 

Step 3. Simplify a qualitative decision model. Based on domain knowledge and the 
causal model, a decision model is created to simulate the process of decision 
making from observations. 

Step 4. Assign probabilities. The relevant probabilities are given from the observa- 
tions. 

Step 5. Verify and refine the model. The model will be verified and refined according 
to the real requirements, the errors and risks will be reduced to the minimal 
level. 

Step 6. Perform sensitivity analysis. Test the designed model and make sure a small 
change could not affect the system dramatically. 


8.2.1.2 Decision Tree 
Decision tree [1] represents all Boolean functions which split the records of decision 
making based on the attributes that optimize current criterion. A decision tree needs 
to answer the questions: How to split the records? How to specify the attribute test 
condition? How to determine the best split? When to stop splitting? 
In a decision tree, decision nodes are presented as squares, chance nodes are 
symboled by circles, and end nodes are marked by triangles as shown in Fig. 8.1. 
In decision tree, we split the records for decision making based on the attribute 
test. Multiway split is used by many partitions. Binary split divides values into two 
subsets. Stop expanding a node when all the records belong to the same class or 
when all the records have similar attributes [1]. 
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A decision tree is a flowchart-like structure in which internal node represents a 
“test” on an attribute, each branch represents outcome of the test, and each leaf node 
represents a class label; a decision will be taken after computing all attributes. The 
paths from the root to leaves represent classification routines. 

In decision making, a decision tree and the closely related influence diagram 
are used as a visual and analytical decision support tool where the expected values 
of competing alternatives are calculated. A decision tree consists of three types of 
nodes: decision nodes, chance nodes, and end nodes [1]. 


8.2.1.3 Deep Neural Decision Forests and Decision Networks 
Mathematically, a tree is modeled as [15], 


Prix, 0, x) = ) myu lo) (8.17) 
leZ 
malo = || 4a 0) 2G; 0) (8.18) 
neEMN 
adf (x; 0) = 1 —d”~) (x; 0) (8.19) 


where a n) (x; 0) is a sigmoid function, / / n is the left subtree, and n N / is the 
right subtree of node n. “M and -Z are internal nodes and the terminal nodes of 
the tree, respectively. 7 is the probability of a sample reaching leaf / to take on 
class y. u (x]0) is the probability that sample x will reach leaf /, $;< y wi(x|0) = 1, 
Vx € 2,6 is a parameter. 

A deep neural decision forest (ANRF) P.z(y|x) = i aa, Pr, 7 (y|x) is an en- 


semble of decision trees F = {T1, T2, ..., Tg} with average. The log loss of a tree is, 
1 
RO, x, 7)=-—- J` log(RrGlx, 6, 7)) (8.20) 
FI 4 
x yeZ 


where Y C 2 x Y = {(x, y)}. We minimize R(@, x, J) as: 


dR(O, 1; B) 
n . 


oD = 90 (8.21) 
0 
where 7 > 0 is a learning rate, 4 C J. my could be updated as 
(i+1) 1 lay) u (x|@) 
mp = ~~ (8.22) 


ca Prox, 6.00) 
Z| CRNA Prox, 0,7) 
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Fig.8.2 Random forest for decision making 


Therefore, min (R (0, x, .7)) could be implemented by using the mini-batch con- 
sisting of the optimized x and 0. A deep neural decision forests (ANDF) can be 
implemented by using typically available fully connected (or inner product) and 
sigmoid layers in DNN frameworks, 

In decision making, random forest could help us to make correct decision based 
on voting theory of pattern classification. When we have multiple decision trees 
available, we therefore construct a random forest for decision making. In the forest, 
we have multiple trees to reflect the decisions, and we therefore need to fuse the 
decisions together, shown in Fig. 8.2. When the mixture of multiple decisions such 
as voting, boosting, etc., are taken, we think the correctness of decision making could 
be achieved. 

Random forest [5] is an ensemble learning method for classification that operates 
by constructing a multitude of decision trees at training time and output by individual 
trees. The method combines the “bagging” idea and the random selection of features 
in order to construct a collection of decision trees with controlled variance. 

After each tree is built, all of the data is run down the tree, and proximities are 
computed for each pair of cases. If two cases occupy the same terminal node, their 
proximity is increased by one. At the end of the run, the proximities are normalized 
by the number of trees. Proximities are used in replacing missing data, locating 
outliers, and producing illuminating low-dimensional views of the data. 

A decision diagram or a decision network is a compact graphical and mathematical 
representation of a decision situation. It is a generalization of a Bayesian network, in 
which not only probabilistic inference problems but also decision-making problems 
are modeled. The steps of a decision network are: 


Step 1. Set evidence variables for the current state 

Step 2. Set possible value of the decision node 

Step 3. Set the decision node to that value 

Step 4. Calculate the posterior probabilities for the nodes 
Step 5. Output the result for the action. 
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In a decision network, each decision node is assigned a variable and possible 
value for current state. After this initialization, the posteriors for the nodes will be 
calculated based on priority and evidence. With the posteriors, the effective decision 
of each node will be made for practical applications. 


8.2.2 Complex Decision Making 


Markov decision processes (MDPs) provide a mathematical framework for modeling 
decision making in situations where outcomes are partly random and partly under 
the control of a decision maker [16]. MDPs are useful for studying a wide range of 
optimization problems through dynamic programming and reinforcement learning 
which are used in a wide area of disciplines including robotics, automated control, 
economics, and manufacturing. A Markov decision process (MDP) model contains: 


e A set of possible world states S 

e A set of possible actions A 

e A real-valued reward function R(s) 
e A transition model T (s, z, s*). 


A Bellman equation in Eq. (8.23) is a necessary condition for optimality associ- 
ated with dynamic programming which writes the value of a decision problem at 
a certain point in terms of the payoff from some initial choices and the value of 
the remaining decision problem resulting from those initial choices. This breaks a 
dynamic optimization problem into simpler subproblems: 


U(s) =R(s) + y- max = T (s,m, 8") - U(s*) (8.23) 


s* 


where Eq. (8.24) shows the Bellman updates, 


Uisi(s) — RG) + y -max X T, x, s*)- Uj(s"), i= 1,2, ... (8.24) 
s* 


The convergence of these updates is guaranteed by Eq. (8.25), 


lUn) — U <s, e>0,1>y>0 (8.25) 


This ensures that Uj; ~ U, namely 


llU; —U|| < ££ >0 (8.26) 
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8.2.3 Alarming Channels 


Once a surveillance alarm is given, it should be dispatched by multiple channels [2]. 
Typically, we apply the methods available below for message delivering: 


e Email. An email is an official record to track back; emails could be checked offline 
but not timely. An email system could be linked to a broadcast system such as 
Twitter, Facebook, or a Web site. 

e Mobile SMS/MMS. A mobile is a “Swiss Army knife” with multiple function- 
alities in communications; a mobile could be operated at anywhere in anytime if 
the ring could be rung. 

e Call. Phone call is the shortest and most convenient way to communicate timely 
and fast. Most countries allow some special numbers to be dialed free for public 
services. 

e Speakers. Loud speakers make sure everybody could be informed directly and 
indirectly in public area timely. It also helps to remind each other in alarming 
environment. 

e Siren. Siren and flashing lights are for specific facilities who could not access the 
normal messaging channels such as disabled people or people enclosed in special 
space. 


Technically, we implement the alarm delivering through [4], 


e Using the SQL commands gets those alarm events from the event database (eBase) 

e Using email system sends emails or SMS texts to mobiles 

e Sending an email automatically uses PHP, Python, Ruby or other programming 
language. 


For example, PHP language usually could be employed to send emails using the 
following way: 
<?php 
$to = “recipient @example.com”; 
$subject = “Hi!”; 
$body = “Hi,\ n \ n How are you?”; 
if (mail($to, $subject, $body)) { 
echo(“< p >Message successfully sent! </p>”); 
} else { 
echo(“< p >Message delivery failed...< /p >”); 
} 


> 
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Original image Binary image Detected line 


Fig.8.3 Detected line from a surveillance image 


8.3 Alarming for a Traffic Ticketing System 


In this section, we introduce a new implementation that detects traffic ticketing events 
from videos and triggers actions when a car passes through the solid white stop line 
located opposite the traffic light at the moment when it turns red. 

The first step is to detect the solid stop lines at the intersection. The location of 
stop line determines the event trigger condition. Detection of lines from an image is a 
well-known problem, it can be effectively solved using Hough transform (HT) [13]. 
Due to the robustness of HT algorithm, it is the most convenient and powerful method 
for straight line extraction. The equation for HT is defined as Eq. (8.27): 


p =x- cos(0) + y-sin(@) (8.27) 


where (x, y) coordinates will be transferred to space. Hence, to find a line in (x, y) 
space has been converted to search points in (p, 0) space. Then, a vote stage is 
operated to rank peaks in the Hough space where the peaks in Hough space will 
denote the lines. 

After obtained the location of the stop line in the footage, foreground detection 
will be conducted automatically because the video may contain slight shaking from 
the camera and other moving objects such as pedestrians. 

In the traffic ticketing management, the next crucial step is to search for the vehicle 
owner by using the registration plate number in a database. 

The vehicle entity may consist of many attributes such as plate number, model 
number, year built, color, and status. The primary key of this table is its plate number. 
The registration information records how many vehicles are owned by a driver. The 
relationship between the owner entity and the vehicle entity is one-to-many, which 
means the owner can have more than one car. There are three tables: driver table, 
vehicle table, and registration table. 

Figure 8.3 shows the detected line from a surveillance image. Figure 8.4 illustrates 
the line for surveillance alarming. Hough transform correctly identifies the stop line 
before the moving vehicle detection. The location of the stop line is stored and used 
later for event alarming. The condition for triggering the alarm is when a car passes 
the stop line [11]. 
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Fig.8.4 Line for surveillance alarm making 


8.4 Incident Responses 


Once a disaster occurred, we need to respond it timely [17]. The professional steps 
include: 


1. Preparation: Good strategies and tactics will lead to good practices; training 
well is one of the effective ways for preparation. The well-designed processes 
and operations are double-checked at the preparation time, and well preparation 
will lead to operation in logical sequence; avoid rush. For example, we use “drop,” 
“cover,” and “hold” in earthquake practice. The drill is: drop to the floor, get under 
cover, and hold on to furniture. 

2. Identification: Once an alarm is given, the source should be tracked and the 
alarm could be identified. The dispensable examination will make the further 
work much on focus. Usually, the alarms are unique and should be reported in 
real situation. 

3. Containment: This action makes sure that the alarm process is under control and 
committed on track. The dynamic monitoring to the alarm should be taken, any 
changes of the situation should be well known, and the relevant operations and 
processes should be well designed before this containment. 

4. Eradication: An alarm or risk is permanently removed and taken out after the 
incident. This makes the alarming process clear, especially when dealing with 
multiple alarms. The best way to deal with alarms is one by one, in case one 
alarm mixes and triggers another. 

5. Recovery: The process after the alarm making, the scene should be recovered, and 
the layout should be the same as before the alarms. The effective commitment 
will assist the next round of alarm identification and incident response which 
starts from a very clear point. 

6. Lessons learned: Learning from the incident or failure is helpful to avoid fur- 
ther losing or making better success. The experience accumulated in the alarm 
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processing should be introduced to colleagues and makes the peers aware what 
the standard way the alarm processing is. The effective process will lead to the 
success of the team in alarming process and incident response. 


8.5 Questions 


Question (1) What is a decision tree? How to use decision trees to make surveillance 
alarms? 

Question (2) What is a random forest? How to deal with the alarms using random 
forest? 

Question (3) What is the difference between a decision tree and a decision diagram? 
Question (4) What is an expert system? How to use it for making surveillance alarms? 
Question (5) Why policy is important in surveillance alarming and incident re- 
sponse? 
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Surveillance Computing 


In 1975, Dr Gordon Moore, co-founder of the company Intel, rectified his obser- 
vations and reiterated doubling every 2 years in the number of components per 
integrated circuit and predicted this rate of growth would continue for at least an- 
other decade. The prediction, later called Moore’s law, has spurred the computer 
chips industry in decades. 


9.1 Challenges of Modern Computing 


Since 2005, we have experienced the bottleneck of computing, Moore’s law seems 
coming to an end, chips speed stopped increasing, and transistors could not be 
squeezed into the small space anymore; hence, computer CPU power could not in- 
crease too much. Therefore, supercomputing and high-performance computing are 
needed indeed [36,46, 46]. 

Since the wake of supercomputing era in 1960s, the supercomputing technology 
has an enormous development ever since. Within almost two decades, supercom- 
puters have gone up to 200,000 times faster and have been used in many areas of 
development that requires millions of processors. 


9.1.1 Parallel Computing 


Parallel computing is a form of computation in which a breadth of calculations is ful- 
filled simultaneously, operating on the principle that large tasks often are divided into 
small ones, which are then solved concurrently (“in parallel’). There are different 
forms of parallel computing: bit-level, instruction-level, and task parallelism. Paral- 
lelism has been put into practice for decades in high-performance computing [36]. 
As power consumption has become a concern in recent years, parallel computing 
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Fig.9.1 Serial-parallel programming 


has become the dominant paradigm in computer architecture, mainly in the form of 
multicore processing. 

Parallel computers are roughly classified into the levels at which the hardware 
supports multicore and multiprocessor having multiple processing elements within a 
single machine, while clusters and grids are in the use of multiple computers to work 
on the same task. Specialized parallel architectures are utilized alongside traditional 
processors for accelerating specific tasks [30,40]. 

Traditionally, computer software has been written in serial way. The instructions 
are executed on a central processing unit of one computer. One instruction may 
be executed at a time only after previous one has been finished; the next is to be 
executed. Parallel computing, on the other hand, uses multiple processing elements 
simultaneously. This is accomplished by breaking a task into independent parts so 
that each processing element executes its part of the algorithm simultaneously with 
the others. The processing elements are diverse and include resources such as a 
single computer with multiple processors, several networked computers, specialized 
hardware, or any combinations of them. 

Parallel programs are more difficult to be written than sequential ones since con- 
currency introduces new classes for potential software. Communication and synchro- 
nization between the different subtasks are typically to get good parallel program 
performance. Figure9.1 shows a serial and parallel model for solving the parallel 
programming problem. 

Concurrent programming languages, libraries, APIs, and parallel programming 
have been created for programming in parallel which utilize the memory architecture: 
shared memory, distributed memory, or shared distributed memory [45]. 

OpenMP is the most broadly used shared memory APIs, whereas Message Passing 
Interface (MPI) is adapted, one part of a program promises to deliver a required da- 
tum to another part of a program. In the runtime library of OpenMP, the “vcomp.lib” 
and “vcompd.lib” provide functions of multithread-based dynamic link for C/C++ 
programming. A typical example of the OpenMP routine is, 


# include “omp.h” 

void main(){ 

# pragma omp parallel{ /* create threads */ 
int ID = omp_get_thread_num(); 
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printf(“ hello(%d) , ID); 
printf(“ world(%d) \n, ID); 
} 

} 


The output is, 


“hello(1) hello(O) world(1) world(O)” 

where omp_get_thread_num(-) is the function for creating threads, the relevant 
functions also include: omp_get_max_thread (-), omp_get_num_thread (-), omp_set 
_num_thread (-), etc. 


9.1.2 Multicore Processing 


A multicore processor is a single computing component with two or more inde- 
pendent central processing units (called “cores”), which read and execute program 
instructions. The CPU instructions are ordinary and basic ones such as “add,” “move,” 
and “branch,” but the multiple cores run the instructions at the same time and in- 
crease overall speed for programs amenable to parallel computing. Manufacturers 
typically integrate the cores onto a single integrated circuit die (known as a chip 
multiprocessor or CMP) or onto multiple dies in a single-chip package. 

Processors were originally developed with only one core. A multicore processor 
implements multiprocessing in a single physical package. Designers may couple 
cores in a multicore device tightly or loosely. For example, cores may or may not 
share caches, and may implement message passing or shared memory inter-core 
communication methods. Common network topologies to interconnect cores include 
bus, ring, two-dimensional mesh, and crossbar. Homogeneous multicore systems 
include only identical cores; cores are not identical. Just as with single-processor 
systems, cores in multicore systems may implement architectures such as vector 
processing or multithreading. 

Multicore processors have been applied to networks, digital signal processing 
(DSP), and graphics [25,40]. The improvement in performance gained by using a 
multicore processor depends vary on software algorithms and the implementation. In 
particular, possible gains are limited by the fraction of software that is run in parallel. 
In the best case, so-called embarrassingly parallel problems may implement speedup 
factors near the number of cores; or even more if the problem is split up enough to 
fit within each core’s cache, avoiding use of much slower main system memory. 
Most applications, however, are not accelerated so much unless programmers invest 
a prohibitive amount of effort in refactoring the whole problem. Figure9.2 shows 
the configuration of a multicore system. 
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Fig.9.2 Specifications of a multicore computer system 


9.1.3 Cluster Computing 


Cluster computing addresses the latest results in the fields that support high- 
performance distributed computing (HPDC). In HPDC environments, parallel and 
distributed computing techniques are applied to the solution of computationally in- 
tensive applications across networks of computers [57]. 

In a nutshell, network clustering connects independent computers together in 
coordinated fashion. Because clustering is a term used broadly, the hardware con- 
figuration of clusters varies substantially depending on the networking technologies 
chosen and the purpose. 


9.1.4 Supercomputing 


Supercomputing is historically achieved by vector computers, now are parallel or 
parallel vector. A supercomputer is a computational machine at the frontline of con- 
temporary processing capacity, particularly the calculations happen at the speed of 
nanoseconds. While the supercomputers of the 1970s adopted only a few processors, 
in the 1990s, machines with thousands of processors began to appear, by the end of 
the twentieth century, massively parallel supercomputers with thousands of “off-the- 
shelf” processors were the normal. Since 2013, China’s Tianhe-II supercomputer 
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has been the fastest one in the world. In 2016, China’s Sunway TaihuLight installed 
by China National Supercomputing Center was ranked as the world’s fastest super- 
computer on a Top500 list. In 2018, IBM Summit from Oak Ridge, USA is taking 
its leading position (122.3 PFLOPS). 

The systems with massive numbers of processors generally take one of two paths. 
In one approach (e.g., in distributed computing), a large number of discrete computers 
(e.g., laptops) distributed across a network (e.g., the Internet) devote some or all of 
their time to solve a real problem; each individual computer (client) receives and 
completes small tasks, and reports the results to a central server which integrates the 
tasks from all the clients into the overall solution [40]. In another approach, a large 
number of dedicated processors are placed in close proximity to each other (e.g., 
in a computer cluster); this saves considerable time moving data around and makes 
it possible for the processors to work together (rather than on separate tasks), for 
example, in mesh and hypercube architectures [40]. 

Supercomputers play an important role in the field of computational science and 
are used for a great deal of computationally intensive tasks in various fields, including 
quantum mechanics, weather forecasting, climate research, oil and gas exploration, 
molecular modeling, and physical simulations. 

Supercomputing covers a number of activities and highlights: 


e Domain-specific tertiary-level qualifications: the domains are related to computer 
science, geosciences, physics, chemistry, mathematics, and engineering. 

e Parallel programming: this programming includes distributed-memory program- 
ming and shared-variable programming. 

e High-performance computing: the computing covers hybridization, performance 
analysis, optimization, and scaling [36]. 

e Scientific computing: typically scientific computing comprehends to numerical 
libraries and application-specific packages. 

e Data management: data from various area with big volume need to be managed 
with database design, data grid, distributed computing, and metadata [8, 10, 16]. 


9.1.5 High-Performance Computing 


High-performance computing (HPC) aims at solving scientific problems via super- 
computers and fast networks as well as visualization [36]. HPC takes use of parallel 
processing for running advanced applications efficiently, reliably, and quickly [30]. 
The term HPC is occasionally used as a synonym for supercomputing, though tech- 
nically a supercomputer is a system that performs at or near the currently highest 
operational rate for computers. Generally speaking, HPC solves problems via super- 
computers plus fast networks and visualization. 
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9.1.6 Multithread Computing 


A thread is a line or flow of control through a program. There are three models for 
creating threads: many to one, one to one, or many to many; this depends on how many 
processors will be applied to a program. Threads have been built into programming 
language Java. In Java, mutexes, condition variables, and thread join are used for 
the purpose of multithread synchronization in programming. Meanwhile, parallel 
platform OpenMP adopt directives, library routines, and environment variable to 
carry out the same work [36]. 

Multithreading is the ability of a program to manage its use by more than one 
user at a time or multiple requests by the same user without having multiple copies 
of the programming in the computer. Central processing units (CPU) have hardware 
support to efficiently execute multiple threads [40]. These are distinguished from 
multiprocessing systems (such as multicore systems) in that the threads have to 
share the resources of a single core: computing units, CPU caches, and translation 
lookaside buffer (TLB). Multithreading aims to increase utilization of a single core 
by using thread-level as well as instruction-level parallelism. As the two techniques 
are complementary, they are sometimes combined in systems with multithreading 
CPUs with multithreading cores. The advantages include: 


e Ifa thread gets a vast of cache misses, the other thread(s) can continue, taking 
advantage of the unused computing resources, which thus can lead to faster overall 
execution, as these resources would have been idle if only a single thread was 
executed. 

e Ifathread cannot use all the computing resources of the CPU because instructions 
depend on each other’s result, running another thread can avoid leaving these idle. 

e Ifseveral threads work on the same set of data, they can actually share their cache, 
which leads to better cache usage or synchronization [40]. 


Meanwhile, the comments on multithreading also include: 


e Multiple threads can interfere with each other when sharing hardware resources 
such as caches of translation look aside buffers [40]. 

e Execution time of a single thread could not be improved but can be degraded, even 
when only one thread is being executed. This is due to slower frequencies and 
additional pipeline stages that are necessary to accommodate thread-switching 
hardware. 

e Hardware support for multithreading is more visible to software, thus requiring 
more changes to both applications and operating systems than multiprocessing. 


In parallel computing, Fork—Join model is a way of setting up and executing 
parallel programs so that execution branches are off in parallel at designated points 
in the program, to “Join” (refers to merge) at a subsequent point and resume sequential 
execution. Parallel sections may fork recursively until a certain task granularity is 
reached. Fork—Join model can be considered as a parallel version of the divide and 
conquer paradigm shown in Fig. 9.1. 
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Fork—Join model is the main model of parallel execution in OpenMP framework; 
an OpenMP program begins with an initial thread. When any thread encounters a 
parallel construct using the parallel keyword, a team of threads are forked. An implicit 
barrier is used at the end of parallel construct. Any number of parallel constructs can 
be specified in a single program, parallel constructs may be nested. Depending on 
the implementation, a nested parallel structure may yield a team of threads or may 
be completed by a single thread [40]. 


9.1.7 Graphics Processing Unit (GPU) 


Graphics processing unit (GPU), also occasionally called visual processing unit 
(VPU), is a specialized electronic circuit designed to rapidly manipulate and alter 
memory to accelerate the creation of images in a frame buffer intended for output to 
a display. Modern GPUs are very efficient at manipulating computer graphics and 
image processing, the highly parallel structure makes them more effective than gen- 
eral purpose CPUs for algorithms where processing large blocks of data is operated 
in parallel. In a personal computer, a GPU can be presented on a video or graphics 
card, or it can be on the motherboard or in certain CPUs [49]. 

The term GPU was popularized first by Nvidia in 1999 which is a single-chip 
processor with integrated transforms, lighting, triangle setup, clipping, and rendering 
engines. 

GPU programming is assisted by the libraries such as GPU-accelerated BLAS 
library, GPU-accelerated FFT library, GPU-accelerated sparse matrix library, and 
GPU-accelerated RNG library. Typically, the Basic Linear Algebra Subprograms 
(BLAS) library supports vector-vector operations (Level 1), matrix-vector operations 
(Level 2), and matrix-matrix operations (Level 3). The library has been applied to 
system solving, QR decomposition, SVD decomposition, eigenvalues, inverse, least 
squares, Markov chain Monte Carlo (MCMC), genetic algorithm (GA), etc. 

With the emergence of deep learning, the importance of GPUs has increased. 
While training deep learning neural networks, GPUs can be more than 250 times 
faster than CPUs. 


9.2 OpenCL Computing 
9.2.1 OpenCL Programming 


When we use Fork-Join model for arithmetic calculation, the matrix multiplication 
(aij)m xN = (Biju xK* (Ci) KxN traditionally should be, 


K 
Cij =} ar- bgt =1,2...,M;j=1,2...,N. (9.1) 
k=1 
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If we adopt the parallel computation, the multiplication will be, 


K 
cij =) ae chp 12 (9.2) 
k=1 


The difference between Eqs. (9.1) and (9.2) is that the latter uses parallel “for” loop; 
the elements of row i will be given at the “simultaneously” as the multiplication 
output. 

In scalar multiplication, multithread ID will be applied and used to the calculation. 
Previously it was, 


ci =dai- b; i= 1,2...,N. (9.3) 


Using parallel programming, now it is, 
Cid = Gia ` bia, id = 1,2... , Niq. (9.4) 


The difference is that the multithread id has been applied to the scalar multiplica- 
tion in Eq. (9.5) instead of i in Eq. (9.3). The id was allocated by the parallel platform 
automatically, that means the computation is parallel. Analogously, the famous plus 
operation “c(i) + +;” in C++ is, 


c(id) + +; (9.5) 


where id is the number of the thread or device number in parallel platform. Based 
on the new definitions of matrix multiplication and scalar multiplication includ- 
ing the famous “++” count increasing operation, the traditional source code built 
on serial programming in linear algebra [30,36], system solving, differential equa- 
tions, numerical analysis, probability and statistics, image/video and digital signal 
processing should be rewritten for catering the computational needs in parallel com- 
puting [25,49]. 

For instance, C/C++ compiler of CUDA [33] Toolkit provides the libraries for 
GPU-accelerated parallel computations such as Basic Linear Algebra Subprograms 
(BLAS) library, fast Fourier transform (FFT) library, sparse matrix library, and ran- 
dom number generator (RNG) library. 

Among them, BLAS library has three levels: vector-vector operations, matrix- 
vector operations, and matrix-matrix operations. The functions cover the computa- 
tions like solving a linear system [30,36], 


A-x=b (9.6) 


where A is a N x N matrix; bis a N dimension vector. The solution is, 


x=A!.b (9.7) 
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A matrix A could be decomposed by using QR decomposition, 


A=OQ-R (9.8) 


where A is a N x N matrix, Q is an orthogonal matrix, its columns are orthogonal 
unit vectors meaning Q7 - Q = 1,Q7 = Q7!, and R is an upper triangular matrix or 
right triangular matrix. 

Singular matrix decomposition (SVD) refers to 


A=UAV* (9.9) 


where A is am x n diagonal matrix with nonnegative real numbers on the diagonal, 
U is an m x m, and V* is the conjugate transpose of the n x n unitary matrix V, 
v! = yT, 

Matrix eigenvalues À; are found by, 


fQ) = det(A —7-A)=0 (9.10) 
Namely, 


XT!AX = diag{Aj, Az, ..., An} (9.11) 


where X is eigenvector matrix. 
The least squares are used for regression, for linear regression given the data points 


(xi, yi), i= 1,2,...,N, we assume the fitting linear equation is, 
f(x) =ax+b (9.12) 
N N 
e=) 01 -f GD) = Doi -— axi — bY (9.13) 
i=1 i=1 


Hence, the linear system for the linear regression is, 


g& —Q¢ 
1 (9.14) 
a = 9 


Solving this linear system [30,36], we can get a and b. 

The Basic Linear Algebra Subprograms (BLAS) are a specified set of low-level 
subroutines that perform common linear algebra operations such as copying, vector 
scaling, vector dot products, linear combinations, and matrix multiplication. They are 
still used as a building block in higher-level mathematical programming languages 
and libraries, including the systems like Matlab, GNU Octave, Mathematica, NumPy, 
and the R. 
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9.2.2 Examples of OpenCL Programming 


Tone mapping is a technique used in image processing and computer graphics to map 
one set of colors to another to approximate the appearance of high-dynamic-range 
images in a medium that has a more limited dynamic range. Tone mapping addresses 
the problem of strong contrast reduction from the scene radiance to the displayable 
range while preserving the image details and color appearance. 

Tone mapping demonstrates how to use high-dynamic-range (HDR) rendering 
with tone mapping effect in OpenCL; Fig. 9.3 illustrates straightforward linear tone 
mapping and advanced tree-component curve-based tone mapping technique applied 
to HDR image. 

God rays in atmospheric optics are rays of sunlight that appear to radiate from 
the point in the sky where the sun is located. These rays, which stream through 
gaps in clouds (particularly stratocumulus) or between other objects, are columns of 
sunlit air separated by darker cloud-shadowed regions. Despite seeming to converge 
at a point, the rays are in fact near-parallel shafts of sunlight, and their apparent 
convergence is a perspective effect. 

As shown in Fig. 9.4, the God rays sample demonstrates how to use high-dynamic- 
range (HDR) rendering with God rays (crepuscular rays) effect in OpenCL. This 


Fig.9.4 The example of God ray, a before the processing, b after the processing 
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Fig.9.5 The example of median filter, a before filtering, b after filtering 


implementation optimizes rendering passes by sharing intermediate data between 
pixels during pixel processing, improves the method performance, and reduces data 
loads. 

Median filtering is one kind of smoothing technique, as linear Gaussian filtering. 
All smoothing techniques are effective at removing noises in smooth patches or 
smooth regions of a signal, but adversely affect edges. Edges are of critical importance 
to the visual appearance of images. For the small to moderate levels of (Gaussian) 
noise, the median filter is demonstrably better than Gaussian blur at removing noise 
while preserving edges for a given, fixed window size. However, its performance is 
not that much better than Gaussian blur for high levels of noise, whereas, for speckle 
noise and salt-and-pepper noise, namely, impulsive noise, it is particularly effective. 

In Fig. 9.5, median filter sample demonstrates how to use median filter in OpenCL. 
This implementation optimizes filtration process using implicit single instruction 
multiple data (SIMD). 


9.3 MATLAB Parallel Computing 


MATLAB provides the functionalities of parallel computing and built-in multithread- 
ing automatically enabled in core MATLAB since R2008a. Concretely, MATLAB 
provides parallel computing toolbox such as optimization toolbox and statistics tool- 
box. 

MATLAB provided parallel computing tools controlled by its users for a variety 
of applications; it has the ability to leverage CPUs and GPUs to step applications 
further. 

Specifically, MATLAB has the parallel for-loops (parfor) for running task-parallel 
algorithms on multiple processors and support for CUDA enabled NVIDIA GPUs 
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[33]. It also has the ability to run eight workers locally on a multicore desktop, 

computer cluster, and grid support with MATLAB Distributed Computing Server as 

well as the capabilities to run interactive and batch execution of parallel applications, 

especially for large dataset handling [16,17] and data-parallel algorithms [6]. 
MATLAB implemented parallel computing through: 


e Worker: an independent MATLAB session that runs code distributed by the 
client [40]. 

e Client: the MATLAB session with which that distributes jobs to workers. 

e Parfor: a parallel computing toolbox that distributes independent code segments 
to workers. 

e Random stream: a pseudorandom number generator and the sequence of values it 
generates. 

e Reproducible computation: a computation that can be exactly replicated even in 
the presence of random numbers. 


A segment of source code for MATLAB programming is listed below as an 
example: 
numberofRuns = 2000; 
tic 
parfor i = 1: numberofRuns 
dataParfor(1, :) = runSimBioModel(); 
end 
elapsedTimeParallel = toc; 


9.4 Mobile Computing for Surveillance 


Mobile phones have become an integral part of our lives. Nowadays, they come 
integrated with multimedia devices such as a camera, speaker, radio, and a mi- 
crophone [38]. While primarily facilitating teleconversations, it also offers addi- 
tional services such as textual communication, games, audio/video playing, radio, 
image/video capturing and transmission, alarm, calculator, and calendar [25,49]. 

More recently, mobile phones are working like personal computers. The powerful 
advancement of mobiles is its ability to receive signals from Wi-Fi and therefore 
link to this world. The more developed Apps based on the operating systems like 
Android and iOS are accelerating the popularization of mobile phones. 

Because of broad applications of 3G/4G/5G/6G communications, a mobile could 
be used for voice communication, text at anywhere in any time; its screen is up with 
inertia resolution and is able to show real-time videos via broadband communica- 
tion [49]. All smartphones at present can even access YouTube and run software such 
as Skype. 
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If surveillance Apps based on mobile platforms are developed, that means we 
could view surveillance data through this portable device [6]; if we could control the 
surveillance device, that will be much valuable. 

Multimedia simplification [53] is a transcoding technology that could be applied to 
multimedia messaging system (MMS) of mobile for audio and video transmission. In 
video transmission, we only send those important frames or part of video streams to a 
receiver that will greatly save the receiver’s time and communication resources [49]. 

When mobile computing meets deep learning [21], compressing networks [37] 
have been recommended as the solutions to reduce the workload caused by deep 
learning and neural networks. 

Neural networks are typically overparameterized, there is significant redundancy 
for deep learning models. Network pruning has been used both to reduce network 
complexity and to reduce overfitting. This potentially makes deep neural networks 
more energy efficient to run on mobile. 

A deep neural network is pruned by removing the redundant connections, keeping 
only the most informative connections. The weights are quantized so that multiple 
connections share the same weight, thus only the codebook (effective weights) and 
the indices need to be stored. Huffman coding is employed to take advantage of the 
biased distribution of effective weights. 


9.5 Cloud Computing for Surveillance 


Cloud computing is a broad expression that includes delivering hosted services such 
as computation, storage, and IO/network on the Internet [47,48]. The advantages of 
using cloud computing make organizations pay particular attention to it because it 
is on demand, self-service, location independent, elastic, and accessible by network 
from everywhere [50]. Cloud computing has the features, 


Significant savings in hardware infrastructure 

Significant savings in technical resources required to maintain the system 
Green solution based on the shared infrastructure at the hosting company 
Easy to install 

Easy to modify after the installation. 


Cloud environment consists of three core components: 


e Software as a service (SaaS): in SaaS, the whole software or application runs on 
physical server and becomes available to consumers on the Internet. 

e Platform as a service (PaaS): PaaS is software and product development tools as 
well as programming languages which are leased by providers so that clients can 
build and deploy their own applications for special use [40]. 

e Infrastructure as a service (IaaS): IaaS delivers computing services, storage, and 
network, typically in the form of VMs running on hosts. 
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Cloud computing is closely related to virtualization of applications [43,44]. When 
application virtualization clients log onto a server, the server can issue different 
clients their own exclusion permissions according to the corporate directories. In 
this way, the permissions to download software by clients can be restricted. This 
also supports the function of centralized management and deployment of a single 
image. Once the software has been matched and used, the software preferences and 
profiles would be saved in the clients’ cache memory to ensure that the clients are 
able to access those software when they are offline. Any service patches and updates 
of the software are applied to the application virtualization server image, when the 
clients match the software next time, they could choose to update the software and 
get the newer version. If clients are not sure about the newer version of the software 
compatibility to their operating systems [40], clients could revert back to the older 
version, but the condition is that the older version is still retained on the application 
virtualization server. In this way, there is a list of available software for the client, who 
exists as a graphical user interface (GUI)-based window on the clients’ computers, 
and the clients are able to match and run any software from the list anytime [40]. 

In the second type of application virtualization, the software is loaded as an image 
from the application virtualization server remotely, the software runs in the applica- 
tion virtualization servers. This kind of application virtualization is well known as 
secure local virtual desktop images. The most significant advantage of the second 
type of the application virtualization is that it does not matter what operating system 
is in the clients’ machines for executing the software as they are being run in the 
server [40]. Another advantage about this type of application virtualization is that it 
is more suitable for mobile devices, like mobile phones and iPads, as these mobile 
devices do not have enough processing power to run processor applications, but a 
powerful server does indeed. 

The third type of application virtualization is presentation virtualization, which 
represents a category of virtualization technology that abstracts the processing of an 
application from the delivery of its graphics and I/O. With the application installed in 
a single location, presentation virtualization enables its use by multiple users. Each 
user connects to an individual session that stops a server supporting presentation vir- 
tualization. Within that session, there are the applications that have been provisioned 
for the user. In the presentation virtualization, the applications are being run on a 
remote computer; however, the user interface (UT) is transmitted over the network to 
the thin machine from the server. 

Cloud computing is a storage-oriented system that is growing rapidly in recent 
years [61]. Video and image storage shows its challenges with the substantial re- 
quirement of infrastructure because visual surveillance needs storage facilities like 
big data that may be costly to any users. Also, once the storage disks are full or dam- 
aged, the huge data will be in risk, and thus the backup is absolutely needed. With 
timely data backup, users can easily access the data in anytime without worrying 
about cloud facilities. 

VSaaS is primarily driven by numerous pivotal factors such as dynamic tech- 
nology, cybersecurity, and remote access. A cloud-based visual surveillance system 
(CVSS) allows any users to benefit from upfront capital costs. The surveillance 
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system therefore lessens the resources and human workload. As visual surveillance 
requires sufficient space to store the big surveillance data, we think the first research 
problem is how to dynamically allocate enough space to deposit these videos and 
images from surveillance sensors. We believe pushing notification is an important 
feature of a cloud-based system. 

If surveillance systems are based on cloud computing [2,4,5,11,41], a browser 
is secure enough for surveillance users to access the relevant surveillance con- 
tent [35,39], security staff could use the browser to view alarms and respond 
incidents that will greatly reduce human workload [13,14,60] shown in Fig. 9.6. Tra- 
ditional software for surveillance such as human face recognition [55], pedestrian, 
and vehicle detection [22] could be embedded into surveillance systems as a cloud- 
based service of software, the captured video footages, detected objects, and rec- 
ognized events could be archived into the cloud-based databases [16,19,31,32,54]. 
The huge space for surveillance could be dynamically allocated on private cloud and 
hybrid cloud [28,42,58,59]. Intelligent surveillance could be as a service of cloud 
computing [23,24,26,29,34,56]. The push messaging service could be applied to 
mobile phones and other terminal devices which is dramatically different from the 
previous surveillance systems [1,3,7,9,11, 12,38]. 
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9.6 Questions 


Question 1. What is the Fork—Join model? What are its advantages and disadvan- 
tages? 

Question 2. Please list the differences between the serial programming and paral- 
leling programming. 

Question 3. Which toolbox of MATLAB provides the features of parallel computa- 
tion? 
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Glossary 


Data fusion The process of integration of multiple data and knowledge representing 
the same real-world object into a consistent, accurate, and useful representation. 
Ontology The nature of existence as well as the basic categories of objects and their 
relations. 

Metadata Data about data, namely additional information of a given set of data. 
Camera calibration Correcting image distortions which are the phenomenon that 
lines of the image are curved due to the surface irregularities of camera lens when 
obtaining the images from sensors. 

Event In textual topic detection and extraction, an event is something that happened 
somewhere at a certain time. 

Finite State Machine (SVM) A model of behavior composed of a finite number 
of internal states (or simple states), the transitions between these states, the set of 
events that occur, and actions performed. 

Telic event Telic event is an event that has end point of its time interval; however, 
atelic event has not end point. 

Atomic event Atomic event is an elementary one which cannot be divided into 
further events and is the simplest type of event. 

Composite event Composite event is defined by composition of two or more atomic 
events, and it is a complex event. 

Bayesian network A graphical model for representing conditional independencies 
between a set of random variables. 

Dynamic Bayesian network A Bayesian network represents sequences of variables. 
Deep learning Deep learning has powerful ability of nonlinear processing using a 
cascade of multiple layers for feature transformation and end-to-end learning. 
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